This study yielded three important findings: (1) SANRA can be applied to manuscripts in everyday editorial work. (2) SANRA’s internal consistency and item-total correlation are sufficient. (3) SANRA’s inter-rater reliability is satisfactory.
Feasibility
It is our experience with the current and earlier SANRA versions that editors, once accustomed to the scale, can integrate the scale into their everyday routine. It is important, however, to learn how to fill out SANRA. To this end, together with SANRA, we provide definitions and examples in the explanations and instructions document, and we recommend that new users train filling out SANRA using this resource. Editorial teams or teams of scientists and/or clinicians may prefer to learn using SANRA in group sessions.
Consistency and homogeneity
With Cronbach’s alpha of 0.68 and corrected item-total correlations between 0.33 and 0.58, we consider the scale’s consistency and item homogeneity sufficient for widespread application. It should be noted that because coefficient alpha increases with the number of items [12], simplifying a scale by reducing the number of items—as we did—may decrease internal consistency. However, this needs to be balanced against the practical need for brevity. In fact, the earlier seven-item versions of SANRA had higher values of alpha: 0.80 and 0.84, respectively [9]. Still, the number of items is not necessarily the only explanation for differences in alpha values. For example, the manuscripts included in the two earlier studies may have been easier to rate.
Inter-rater reliability
The scale’s intra-class correlation (0.77 after 0.76 in [9]) indicates that SANRA can be used reliably by different raters—an important property of a scale that may be applied for manuscript preparation and review, in editorial decision-making, or even in research on narrative reviews. Like internal consistency, reliability increases with the number of items [12], and there is a trade-off between simplicity (e.g., a small number of items) and reliability. While the ICC suggests sufficient reliability, however, the lower confidence limit (0.57) does not preclude a level of reliability normally deemed unacceptable in most applications of critical appraisal tools. This finding underscores the importance of rater training. Raters more often disagreed on items 1 and 4. After the study, we have therefore slightly edited these items, along with items 5 and 6 which we edited for clarity. In the same vein, we revised our explanations and instructions document.
It is important to bear in mind that testing of a scale always relates only to the setting of a given study. Thus, in the strict sense, the results presented here are not a general feature of SANRA but of SANRA filled out by certain raters with regard to a particular sample of manuscripts. However, from our experience, we trust that our setting is similar to that of many journals, and our sample of manuscripts represents an average group of papers. As a consequence, we are confident SANRA can be applied by other editors, reviewers, readers, and authors.
Validity
In a post hoc analysis, we found a modest, but statistically significant correlation of SANRA sum scores with manuscript acceptance. We interpret this as a sign of criterion validity, but emphasize that this is both a post hoc result and only a weak correlation. The latter, however, points to the fact that, at the level of submitted papers, other aspects than quality alone influence editorial decision-making: for example, whether the topic has been covered in the journal recently or whether editors believe that authors or topics of manuscripts have potential, even with low initial SANRA scores. SANRA will therefore often be used as one, and not the only, decision aid. Also, the decision to accept a paper has been made after the papers had been revised.
Moreover, additional results on criterion validity are needed, as are results on SANRA’s construct validity. On the other hand, SANRA’s content validity, defined as a scale’s ability to completely cover all aspects of a construct, will be restricted because we decided to limit the scale to six items, too few to encompass all facets of review article quality—SANRA is a critical appraisal tool and not a reporting guideline. For example, we deleted an item on the accessibility of the manuscript. Other possible domains that are not part of SANRA are, for example, originality of the manuscript or quality of tables and figures. These features are important, but we believe the six items forming SANRA are a core set that sufficiently indicates the quality of a review manuscript and, at the same time, is short enough to be applied without too much time and effort. SANRA’s brevity is also in contrast to other tools to assess articles, such as AMSTAR 2, for systematic reviews, or, to a lesser extent, CASP for RCTs, with its 16 and 11 items, respectively.
Throughout this paper we have referred to the current version of SANRA as the revision of earlier forms. This is technically true. However, because it is normal that scales go through different versions before publication and because this paper is first widespread publication of SANRA, we propose to call the present version simpy SANRA.
While medicine has achieved a great deal in the formalization and improvement of the presentation of randomized trials and systematic review articles, and also a number of other text types in medicine, much less work have been done with regard to the most frequent form of medical publications, the narrative review. There are exceptions: Gasparyan et al. [13], for example, have provided guidance for writing narrative reviews, and Byrne [14] as well as Pautasso [15] has written, from different angles, thoughtful editorials on improving narrative reviews and presented lists of key features of writing a good review—lists that naturally overlap with SANRA items (e.g., on referencing). These lists, however, are not tested scales and not intended for comparing different manuscripts. SANRA can be used in comparisons of manuscripts the way we used it in our editorial office, that is, in one setting. At the present time, however, it seems unwise to compare manuscripts across different settings because, so far, there are no established cut-offs for different grades of quality (e.g., poor-fair-moderate-good-very good). Still, in our experience, a score of 4 or below indicates very poor quality.
Limitations
The main limitation of this study is its sample size. While, in our experience, a group of 30 is not unusual in testing scales, it represents a compromise between the aims of representativeness for our journal and adequate power and feasibility; it took us about 6 months to sample 30 consecutive narrative reviews. Also, in this study, the authors of the scale were also the test-raters, and it is possible that inter-rater reliability is lower in groups less familiar with the scale. As for most scales, this underscores the importance of using the instructions that belong to the scale, in the present case the explanations and instructions document. It is also advisable to train using the scale before applying SANRA for manuscript rating. In addition, by design, this is not a study of test-retest reliability, another important feature of a scale. Finally, as previously acknowledged, although we believe in the representativeness of our setting for medical journals, the present results refer to the setting of this study, and consistency and reliability measures are study-specific.