https://rpubs.com/staszkiewicz/CCR_rep_EN [04-konferncje/MSM]

Data source

https://www.dropbox.com/scl/fo/sajrtchhqeftivhd4zwqu/h?rlkey=u37l184e2khoxfufulsyq00kc&dl=0

Past and copy above link to your borwser download the data cvs

“Dobre_dane_Regresja_czyste_8_Wrzesnia_2015gdt.csv” for the replicaton of the programming below. Click the “code” button to see the orginal R code. (the upper right corner)

Introduction

In the methods of systematic reviews, the inclusion or exclusion of literature items in the review is a major area of discretion for the researcher, and thus a methodological weakness. Because we do not have objective and complete criteria for the inclusion or exclusion of literature items. In practice, various filtering criteria are used, e.g. journal ranking, citation count, language, type of contribution (e.g. only articles, no monographs), time period, etc. This is due to the fact that the originally identified population can be several thousand items, while a reasonable literature review up to about 100 items. Typically, PRISMA is used, i.e.

Liberati et al. (2009)

Liberati et al. (2009)

The image shows in the oval the location of the general population reduction procedure according to Liberati et al. (2009).

The idea behind citation regression is that you don’t need to indicate the filtering mechanisms arbitrarily, and you can do it algorithmically. Or use the CCR for the robustness section, to show the study results stability.

Let’s imagine that we already have a selected entire population of articles that relate to an issue of interest (e.g., non-audit components of auditors’ fees). There may be an infinite number of these items, but some of them are important and recognized, and some may be marginal additions to the mainstream considerations. If anyone reading all this literature had expertise in auditing, they could have selected important items for the literature review. Such a discretionary choice is based on the prestige knowledge and competence of whoever is making the selection. However, two people with similar levels of knowledge and competence can make different choices - this is called the researcher’s bias. In view of the above, how would this be done to be objective. PRISMa, in its essence, does not provide an answer to this it only recommends repetition, i.e. explaining the choice. CCR (citation count regression) is built on a different philosophy, namely, it attempts to define the entire research area and select those items that either most affect the mainstream or define the limit of cognition.

CCR

CCR is a regression, where the dependent variable is the annual number of citations of the paper, and the independent variables are derived from bibliometric studies of factors affecting citations. CCR assumes that the number of citations indicates the scientific quality of a paper, and it is a stimulus (i.e., more citations equals better quality). This assumption is quite naive and heavily criticized, but the instrument itself is quite reasonable.

\[ y = aX+b \] in our example \(y\) is the annual number of citations, while \(X\) is a vector of independent variables, \(b\) a free expression.

The vector \(X\) can consist of any logical set of variables typical of the domain. Without exploring the topic further, let’s consider the following variables affecting the number of citations. \(Years\) the number of years since the article was published \(BigSampe\) a binary variable equal to 1 if the article uses a sample of more than 1000 items, \(Method\) a binary variable of 1 for regressions, \(Anglo-Saxon\) a binary variable if the market discussed in the article is the US, Australia, Singapore, UK or New Zeleanida, \(USA\) a binary variable if the study is about the US, \(TimeSpan\) the time range of the sample analyzed in the article (number of years), \(Sample size\) the number of items in the sample, \(BusinessSupport\) takes the value 1 when the study was externally funded by business entities,

CCR operation

Let’s imagine that we have a collection of articles imported from any abstract database (e.g. WoS or Scopus), the variables for regression are built on the basis of metadata i.e. information that is found in the bibliometric description or abstract of the item.

If we draw a scatter plot i.e. number of citations versus dependent variables, we can basically get either a “non-concrete” random image Panel A, or a set of points resembling some curve Panel B. In both cases, we can fit linear regressions Panel C and D but the errors of these fits can be significant (interestingly, this method is robust to them).

We can interpret the blue regression line on Panel C and D as the main current within the analyzed literature. Namely, most of the papers are close (according to the independent variables) to the regression line, hence the papers that are away from it are either about side areas, new perspectives or represent substantively unsuccessful papers over which other researchers have mercifully let down the curtain of silence.

And from there it’s close to understanding the idea of CCR. Well, let’s add to our list one more paper represented by the red dot on Panels E and F. Let’s recalculate the regression and plot it again we get a new regression line (the red line on Panels E and F). This one paper affects the mainstream by deflecting it slightly (Panel F) or significantly changing its direction (Panel E). This direction is nothing but \(Delta\) i.e. the change in the value of the directional coefficient of the regression after taking into account the new article. In view of the above, our literature largely consists of articles inside the mainstream and those outside the mainstream that significantly affect it.

library(ggplot2)
# generate data

y<- runif(100, min=0, max=50)

z<- runif(100, min=1, max=12)

x<- 2*y+ z

model<- lm(y~x)

#ggplot

df1<-data.frame(x,z)
df<-data.frame(x,y)
df2<-rbind(df,c(75,0.5))
df3<-rbind(df1, c(75,0.5))


plot1<- ggplot(df1, aes(z, y))+
  geom_point() +
      theme_bw()+
        labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
 ggtitle("Panel A") 

plot2<- ggplot(df, aes(x, y))+
   geom_point() +
      theme_bw()+
        labs(x = 'Independent varialbes', y = 'Yearly number of citations')+ #+
  #annotate("point", x = 75, y = 0.5, colour = "red") # to dodaje punt kolorowy
  ggtitle("Panel B") 


plot1a<- ggplot(df1, aes(z, y))+
  geom_point() +
    geom_smooth(method=lm)+
      theme_bw()+
        labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
 ggtitle("Panel C") 

plot2a<- ggplot(df, aes(x, y))+
   geom_point() +
    geom_smooth(method=lm)+
      theme_bw()+
        labs(x = 'Independent varialbes', y = 'Yearly number of citations')+ #+
  #annotate("point", x = 75, y = 0.5, colour = "red") # to dodaje punt kolorowy
  ggtitle("Panel D") 




# Zrobię sobie regresje
# Sciagam parametry modelu
coeff<-coefficients(model)
wolny<-coeff[1]
nachyl<-coeff[2]

plot3<- ggplot(df3, aes(x, z))+
  geom_point() +
    geom_smooth(method=lm)+
      theme_bw()+
        labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
          annotate("point", x = 10, y = 0.5, colour = "red") + # to dodaje punt kolorowy
          geom_abline(intercept = wolny, slope = nachyl, color="red", 
               linetype="solid", size=1.5)+
  ggtitle("Panel E") 

plot4<- ggplot(df2, aes(x, y))+
   geom_point() +
    geom_smooth(method=lm)+
      geom_smooth()+
      theme_bw()+
        labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
  annotate("point", x = 75, y = 0.5, colour = "red")+ # to dodaje punt kolorowy
          geom_abline(intercept = wolny, slope = 0.2, color="red",  linetype="solid", size=1.5)+
  ggtitle("Panel F") 



require(gridExtra)

grid.arrange(plot1, plot2,plot1a, plot2a, plot3, plot4, ncol=2, nrow=3)

And well, our task in view of this boils down to removing one point (article) from all recalculating the directional coefficient of the regression for all articles and all but that one, calculating the difference between the directional coefficients and finding those differences that are statistically significant (i.e., there is a test that will show us this). Without going into the technique itself, the method allows us to roughly reduce the original population by a factor of ten to a sample of those articles that are significant.

Since it is an abstract and algorithmic method it removes for us the problem of the researcher’s burden, but after all, we have no assurance that a fundamental work in the literature that is in the mainstream will not be missed. And this feature is its weakness, so in practice we use a combination of algorithmic and discretionary methods. Methodological details Staszkiewicz (2019)

Interestingly, the presented algorithm is suitable for both the selection of articles and checking whether the presented literature review covers all relevant items.

Example of replication

Proceedings in brief

Step 1 We load the data and calculate the regression

Step 2 We calculate leveraged items

Step 3 We determine a sample of articles for further narrative description

Step 4 We evaluate the stabinness of the results and the characteristics of the sample

Step 5 We discuss the technical issues of importing data from WoS

Step 1 Data

We should prepare meta data about our articles, the definitions of variables are included in the table below:

Source:Staszkiewicz (2019).

Let’s load the data into the DF object. We can prepare the data in Excel. We will discuss data import in Step 5.

library(readr)
DF <- read_csv(file.choose())
## Rows: 67 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): PublicationYears, BigSample, Method, AngloSaxon, USA, TimeSpan, Sa...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# kondersja na prosty dataframe i nadanie nazw zgodnie z indeksem papieru
DF<- as.data.frame(DF)
rownames(DF)<-DF$Nr_Papieru

To load the data into the DF object, we used the readr library and the file.choose function so that we can select the data file from the window (we can automate this as well).

We will view the data with the View or head(DF) function and list all the variables available in our database with colnames(DF).

head(DF)
##   PublicationYears BigSample Method AngloSaxon USA TimeSpan Sample
## 2               35         0      0          1   1        0    243
## 3               33         0      0          1   1        0    410
## 5               22         0      1          1   1        1     98
## 6               22         0      1          1   1        3    860
## 7               20         0      0          0   0        0    287
## 8               16         0      0          0   0        0   2428
##   BusinessSupport TC    TCYear EXCLUDED   PY Nr_Papieru
## 2               0  0 0.0000000        0 1980          2
## 3               0  4 0.1212121        0 1982          3
## 5               0 49 2.2272727        0 1993          5
## 6               0 35 1.5909091        0 1993          6
## 7               0  6 0.3000000        0 1995          7
## 8               0  7 0.4375000        0 1999          8
colnames(DF)
##  [1] "PublicationYears" "BigSample"        "Method"           "AngloSaxon"      
##  [5] "USA"              "TimeSpan"         "Sample"           "BusinessSupport" 
##  [9] "TC"               "TCYear"           "EXCLUDED"         "PY"              
## [13] "Nr_Papieru"

Item no. 13 is an index of our papers from the database.

Let’s see the structure of the data - descriptive statistics. For this we will use the psych library and the describe function (this can be done in different ways, of course).

# load the library/załaduj bibliotekę 
#install.packages(psych) # jeśli nie jest zainstalowana
library(psych) 
## 
## Dołączanie pakietu: 'psych'
## Następujące obiekty zostały zakryte z 'package:ggplot2':
## 
##     %+%, alpha
#statystyka opisowa
describe(DF)
##                  vars  n    mean      sd median trimmed    mad  min      max
## PublicationYears    1 67    8.52    8.16    7.0    7.05   5.93    0    36.00
## BigSample           2 67    0.27    0.45    0.0    0.22   0.00    0     1.00
## Method              3 67    0.70    0.46    1.0    0.75   0.00    0     1.00
## AngloSaxon          4 67    0.69    0.47    1.0    0.73   0.00    0     1.00
## USA                 5 67    0.45    0.50    0.0    0.44   0.00    0     1.00
## TimeSpan            6 67    3.87    7.57    1.0    2.20   1.48    0    40.00
## Sample              7 67 1755.17 6348.16  371.0  779.31 413.65    0 51755.00
## BusinessSupport     8 67    0.04    0.21    0.0    0.00   0.00    0     1.00
## TC                  9 67   22.52   46.55    3.0   10.87   4.45    0   227.00
## TCYear             10 67    2.17    3.95    0.5    1.19   0.74    0    18.92
## EXCLUDED           11 67    0.00    0.00    0.0    0.00   0.00    0     0.00
## PY                 12 67 2006.48    8.16 2008.0 2007.95   5.93 1979  2015.00
## Nr_Papieru         13 67   64.72   44.95   55.0   64.07  65.23    1   137.00
##                     range  skew kurtosis     se
## PublicationYears    36.00  1.80     3.10   1.00
## BigSample            1.00  1.02    -0.97   0.05
## Method               1.00 -0.86    -1.28   0.06
## AngloSaxon           1.00 -0.79    -1.40   0.06
## USA                  1.00  0.21    -1.99   0.06
## TimeSpan            40.00  3.67    13.88   0.92
## Sample           51755.00  7.29    54.45 775.55
## BusinessSupport      1.00  4.30    16.78   0.03
## TC                 227.00  2.65     6.79   5.69
## TCYear              18.92  2.48     5.70   0.48
## EXCLUDED             0.00   NaN      NaN   0.00
## PY                  36.00 -1.80     3.10   1.00
## Nr_Papieru         136.00  0.13    -1.49   5.49

Let’s compare to the original:

Source:Staszkiewicz (2019).

As you can see our data is a little different from the original in the text (due to the exclusion of one paper here 67 observations in the original paper 68 observations).

Let’s calculate the regression with this equation:

\[ TC = \beta_0 +\beta_1PublicationYear +\beta_2BigSample +\beta_3Method +\beta_4AngloSaxon+\beta_5TimeSpan+\beta_6Sample+\beta_7BusnessSupport+\beta_8US+\varepsilon \] Source:Staszkiewicz (2019).

For simplicity, we will use a simple regression:

TCY<- lm(TCYear ~ PublicationYears +BigSample+ Method+AngloSaxon+TimeSpan +Sample+BusinessSupport,DF)

After the code is generated, an object is created in our przypakdu TC, which contains information about the linear regression model. Let us see it with the summaryTC function.

summary(TCY)
## 
## Call:
## lm(formula = TCYear ~ PublicationYears + BigSample + Method + 
##     AngloSaxon + TimeSpan + Sample + BusinessSupport, data = DF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6610 -1.4347 -0.4097  0.3778 13.8215 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -1.085e+00  1.206e+00  -0.900   0.3718   
## PublicationYears  1.019e-01  5.645e-02   1.806   0.0760 . 
## BigSample         3.583e+00  1.129e+00   3.173   0.0024 **
## Method            6.020e-01  1.069e+00   0.563   0.5755   
## AngloSaxon        8.287e-01  1.027e+00   0.807   0.4231   
## TimeSpan          1.106e-01  6.237e-02   1.773   0.0815 . 
## Sample           -5.256e-05  7.557e-05  -0.696   0.4895   
## BusinessSupport   2.185e+00  2.184e+00   1.001   0.3212   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.633 on 59 degrees of freedom
## Multiple R-squared:  0.2427, Adjusted R-squared:  0.1529 
## F-statistic: 2.701 on 7 and 59 DF,  p-value: 0.01693

Our model is equivalent to the M4 model from the paper (circled in red), the differences are due to the absence of one observation, and the use of heteroskedastic correction in the original calculation, here we do not need to repeat them.

Note that the models have a low adjusted \(R^2\), so their predictive value is quite weak, \(PublicationYears\) and \(BigSample\) and \(Sample\) interact with the annual number of citations. Both the significance of the variables and the fit of the model will not be necessarily relevant for further consideration. Why - I refer those interested to the methodological details.

For now, let’s see graphically the results of our estimation - let’s apply the plot function.

plot(TCY)

Step 2 Leverage observations

An observation is considered to have high leverage if its value (or values) for the predictor variables are significantly more extreme compared to the other observations in the dataset.

The hatvalues() function is used to calculate the leverage for each observation in the model

# obliczmy  wartość lewara
hats <- as.data.frame(hatvalues(TCY))

# wylistumy te vartości
hats
##     hatvalues(TCY)
## 2       0.21290798
## 3       0.19167812
## 5       0.07835187
## 6       0.07988768
## 7       0.10611081
## 8       0.09475865
## 9       0.04597844
## 10      0.04404193
## 11      0.07201387
## 12      0.07308525
## 15      0.13581008
## 19      0.07100794
## 20      0.06762549
## 21      0.04526655
## 22      0.37665168
## 23      0.09072711
## 24      0.08400364
## 29      0.04133848
## 30      0.06246831
## 35      0.34292671
## 36      0.08739606
## 37      0.04062334
## 38      0.18367009
## 43      0.07222597
## 44      0.39530021
## 45      0.06844399
## 46      0.04191017
## 51      0.04291890
## 52      0.08724704
## 53      0.04695298
## 54      0.06736653
## 55      0.06529267
## 62      0.04571295
## 63      0.04351669
## 64      0.97998610
## 67      0.04362519
## 68      0.07036614
## 77      0.09122794
## 78      0.06180910
## 88      0.34836347
## 90      0.07183608
## 91      0.04735230
## 101     0.07554022
## 102     0.37576933
## 103     0.10478660
## 104     0.05068387
## 105     0.13679124
## 106     0.07345583
## 107     0.13162536
## 108     0.05073110
## 117     0.06632617
## 118     0.12817979
## 119     0.09669277
## 120     0.08084825
## 121     0.07821484
## 122     0.09649332
## 123     0.07595300
## 124     0.07221547
## 127     0.05541486
## 128     0.09904922
## 129     0.06116892
## 130     0.07186645
## 131     0.06992124
## 136     0.06570245
## 137     0.11032187
## 1       0.25670894
## 4       0.19575441

Please note that in the sidebar we have an index of individual papers we did this when importing the data (here numerically, in practice I suggest by first author and year of publication, e.g.: Novak2011, it is easier to search and remember).

Let’s sort the results descending hats[order(-hats['hatvalues(model)']), ].

hats[order(-hats['hatvalues(TCY)']), ]
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
##  [1] 0.97998610 0.39530021 0.37665168 0.37576933 0.34836347 0.34292671
##  [7] 0.25670894 0.21290798 0.19575441 0.19167812 0.18367009 0.13679124
## [13] 0.13581008 0.13162536 0.12817979 0.11032187 0.10611081 0.10478660
## [19] 0.09904922 0.09669277 0.09649332 0.09475865 0.09122794 0.09072711
## [25] 0.08739606 0.08724704 0.08400364 0.08084825 0.07988768 0.07835187
## [31] 0.07821484 0.07595300 0.07554022 0.07345583 0.07308525 0.07222597
## [37] 0.07221547 0.07201387 0.07186645 0.07183608 0.07100794 0.07036614
## [43] 0.06992124 0.06844399 0.06762549 0.06736653 0.06632617 0.06570245
## [49] 0.06529267 0.06246831 0.06180910 0.06116892 0.05541486 0.05073110
## [55] 0.05068387 0.04735230 0.04695298 0.04597844 0.04571295 0.04526655
## [61] 0.04404193 0.04362519 0.04351669 0.04291890 0.04191017 0.04133848
## [67] 0.04062334

The question that arises is which of these results are statistically significant

Let’s see the lemmings graphically

plot(hatvalues(TCY), type = 'h')

Step 3 We will determine the sample of articles for further narrative description.

Our test statistic is

\[ h^*= \frac{2*(k-1)}{n}\] where: \(h^*\) value of the test statistic k- number of estimated parameters n - number of observations

In our case 8 parameters (we include free expression) n- 67 because we have 67 observations Hence the cut-off value 0.2089552.

So all papers whose jpeg value exceeds 0.2089552 are our wanted papers for descriptive review. So let’s see which ones they are:

cutoff<-2*(8-1)/67
z<-hats> cutoff
z
##     hatvalues(TCY)
## 2             TRUE
## 3            FALSE
## 5            FALSE
## 6            FALSE
## 7            FALSE
## 8            FALSE
## 9            FALSE
## 10           FALSE
## 11           FALSE
## 12           FALSE
## 15           FALSE
## 19           FALSE
## 20           FALSE
## 21           FALSE
## 22            TRUE
## 23           FALSE
## 24           FALSE
## 29           FALSE
## 30           FALSE
## 35            TRUE
## 36           FALSE
## 37           FALSE
## 38           FALSE
## 43           FALSE
## 44            TRUE
## 45           FALSE
## 46           FALSE
## 51           FALSE
## 52           FALSE
## 53           FALSE
## 54           FALSE
## 55           FALSE
## 62           FALSE
## 63           FALSE
## 64            TRUE
## 67           FALSE
## 68           FALSE
## 77           FALSE
## 78           FALSE
## 88            TRUE
## 90           FALSE
## 91           FALSE
## 101          FALSE
## 102           TRUE
## 103          FALSE
## 104          FALSE
## 105          FALSE
## 106          FALSE
## 107          FALSE
## 108          FALSE
## 117          FALSE
## 118          FALSE
## 119          FALSE
## 120          FALSE
## 121          FALSE
## 122          FALSE
## 123          FALSE
## 124          FALSE
## 127          FALSE
## 128          FALSE
## 129          FALSE
## 130          FALSE
## 131          FALSE
## 136          FALSE
## 137          FALSE
## 1             TRUE
## 4            FALSE
sum(z, na.rm=FALSE)
## [1] 8

In the sidebar we have the indexes of the papers, while in the second column we have information on whether the paper is or is not influential.

Let’s see how it worked out in the original paper:

Source:Staszkiewicz (2019).

In the original we have 7 papers, in our conversion came out 8 papers. Hence the sample reduction is about 10 times.

Step 4 We will evaluate the stabineness of the results and the characteristics of the sample.

It seems that this method selects the oldest item, and the most recent, usually a review (like Habib here), and the rest depends on how centralized it is in the mainstream discussion. Technically, verification was done on the basis of other literature reviews, but there is no formal test method.

Step 5 Discuss the technical issues of importing data from WoS.

The whole method is based on meta bibliographic information. So, in practice, the DOI is crucial, because it uniquely identifies the paper and its attributes, but without citations. Therefore, bibliometric databases WoS and Scopus are used, unfortunately, both have different algorithms for collecting citations, so they are inconsistent. You can do a review on WoS and check the stability with Scopus or vice versa, although I don’t recommend it - a lot of work little gain. You can also export from Menddeley or Zetoro and manually apply citations, labor intensive, but when writing a review on Word, it may be worth the time investment. For markup work, I recommend Zetoro because it integrates more smoothly.

Back to bibliometric databases. Access is generally possible after registration, it is suggested to export the full record (including the abstract) this helps in decoding variables.

An example of how to do this

Importing from WoS from SGH Library https://www.youtube.com/watch?v=8wVqVvEsYUc&t=2s.

Loading CSV into Excel https://www.youtube.com/watch?v=pY895-S1VWE&t=1s.

Of course, the whole procedure can be done faster in Gretl https://www.youtube.com/watch?v=XIFPuDmrAmA.

But I do not recommend it, because it is a high-level tool, however.

Application

In practice, either to reduce the population of papers for a descriptive (narrative) review or as a supplement to the method for a systematic review, or - this is more from the position of the reviewer - to check whether the presented review article is complete, possibly whether there is a significant burden on the researcher.

Literature

Liberati, Alessandro, Douglas G. Altman, Jennifer Tetzlaff, Cynthia Mulrow, Peter C. Gøtzsche, John P. A. Ioannidis, Mike Clarke, P. J. Devereaux, Jos Kleijnen, and David Moher. 2009. “The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Health Care Interventions: Explanation and Elaboration.” Journal of Clinical Epidemiology 62 (10): e1–34. https://doi.org/10.1016/j.jclinepi.2009.06.006.
Staszkiewicz, Piotr. 2019. “The Application of Citation Count Regression to Identify Important Papers in the Literature on Non-Audit Fees.” Managerial Auditing Journal 34 (1): 96–115. https://doi.org/10.1108/maj-05-2017-1552.