Updating standards for reporting diagnostic accuracy: the development of STARD 2015

Background Although the number of reporting guidelines has grown rapidly, few have gone through an updating process. The STARD statement (Standards for Reporting Diagnostic Accuracy), published in 2003 to help improve the transparency and completeness of reporting of diagnostic accuracy studies, was recently updated in a systematic way. Here, we describe the steps taken and a justification for the changes made. Results A 4-member Project Team coordinated the updating process; a 14-member Steering Committee was regularly solicited by the Project Team when making critical decisions. First, a review of the literature was performed to identify topics and items potentially relevant to the STARD updating process. After this, the 85 members of the STARD Group were invited to participate in two online surveys to identify items that needed to be modified, removed from, or added to the STARD checklist. Based on the results of the literature review process, 33 items were presented to the STARD Group in the online survey: 25 original items and 8 new items; 73 STARD Group members (86 %) completed the first survey, and 79 STARD Group members (93 %) completed the second survey.Then, an in-person consensus meeting was organized among the members of the Project Team and Steering Committee to develop a consensual draft version of STARD 2015. This version was piloted in three rounds among a total of 32 expert and non-expert users. Piloting mostly led to rewording of items. After this, the update was finalized. The updated STARD 2015 list now consists of 30 items. Compared to the previous version of STARD, three original items were each converted into two new items, four original items were incorporated into other items, and seven new items were added. Conclusions After a systematic updating process, STARD 2015 provides an updated list of 30 essential items for reporting diagnostic accuracy studies. Electronic supplementary material The online version of this article (doi:10.1186/s41073-016-0014-7) contains supplementary material, which is available to authorized users.

There is widespread variability in the interpretation of the labels "prospective" and "retrospective".
It is relevant to know in which order question formulation, data collection and analysis took place.

Should we:
Item 7: Describe the reference standard and its rationale.

Consideration:
The rationale for the reference standard is often not reported, and typically not provided in the methods section.

Should we:
Item 8: Describe technical specifications of material and methods involved including how and when measurements were taken, and/or cite references for index tests and reference standard. The relevance of technical information reported may differ between types of tests (e.g. imaging, laboratory, other).

Should we:
Item 9: Describe definition of and rationale for the units, cutoffs and/or categories of the results of the index tests and the reference standard.

Consideration:
This item is ambiguous -accuracy does not depend on the unit of measurement, but may change with the cutoffs and categories chosen to classify test results.

Should we:
Item 10: Describe the number, training and expertise of the persons executing and reading the index tests and the reference standard.
Keep this item as it is n m l k j Modify this item: refer to list of preferred descriptions for specific test types (to be developed) (our suggestion) n m l k j Keep this item as it is n m l k j Modify this item: remove "units" and invite authors to report whether cutoffs and/or categories were prespecified (our suggestion) n m l k j Should we: Item 11: Describe whether or not the readers of the index tests and reference standard were blind (masked) to the results of the other test and describe any other clinical information available to the readers.

Consideration:
There is widespread variability in the interpretation of the label "blind".
It is important to know what information is available to the readers of the tests. This item contains both a negative statement ("blinding") and a positive statement ("clinical information available").

Should we:
Item 12: Describe methods for calculating or comparing measures of diagnostic accuracy, and the statistical methods used to quantify uncertainty (e.g. 95% confidence intervals). Consideration: The nature of statistical methods to be reported seems unclear to many authors: methods for the accuracy statistics, or for the uncertainty, or both?

Should we:
Item 13: Describe methods for calculating test reproducibility, if done.

Consideration:
The word "reproducibility" is ambiguous. Estimating a test's reproducibility is not an element of most diagnostic accuracy studies.
Many studies refer to other publications or to the manufacturer for information on test reproducibility.

Should we:
Item 14: Report when study was done, including beginning and ending dates of recruitment. Consideration: This item is almost always reported in the methods section, rarely in the results section, and refers to participant recruitment (item 4).

Should we:
Item 15: Report clinical and demographic characteristics of the study population (e.g. age, sex, spectrum of presenting symptoms, comorbidity, current treatments, recruitment centers).

Consideration:
Depending on the type of test and target condition, there is a very large variety in suitable clinical and demographic characteristics reported in diagnostic accuracy studies.

Should we:
Item 16: Report the number of participants satisfying the criteria for inclusion that did or did not undergo the index tests and/or the reference standard; describe why participants failed to receive either test (a flow diagram is strongly recommended). Consideration: This is a lengthy and complex item.
Flow diagrams were strongly recommended in STARD, but these are only used in a minority of studies.

Should we:
Item 17: Report time interval from the index tests to the reference standard, and any treatment administered between.

Consideration:
We did not observe major issues with this item.

Should we:
Item 18: Report distribution of severity of disease (define criteria) in those with the target condition; other diagnoses in participants without the target condition. Should we: Item 19: Report a cross tabulation of the results of the index tests (including indeterminate and missing results) by the results of the reference standard; for continuous results, the distribution of the test results by the results of the reference standard.

Consideration:
Long and confusing wording. Indeterminate and missing results are almost never reported in cross tabulations.
It is important to know how indeterminate results, missing responses and outliers were handled, but this is already discussed in item 22.

Should we:
Item 20: Report any adverse events from performing the index tests or the reference standard. Consideration: Adverse events are rarely reported in diagnostic accuracy studies; such studies typically lack the power and design to estimate adverse event rates.
Many tests do not have intrinsic adverse events.

Should we:
Item 21: Report estimates of diagnostic accuracy and measures of statistical uncertainty (e.g. 95% confidence intervals).

Consideration:
We did not observe major issues with this item. Authors should be encouraged to plan ahead how to handle indeterminate results, missing responses and outliers in their study protocol.

Should we:
Item 23: Report estimates of variability of diagnostic accuracy between subgroups of participants, readers or centers, if done.

Consideration:
Test accuracy may vary across subgroups but many diagnostic accuracy studies lack the power to detect such variations.
Variability is often not reported.
Multiple subgroup analyses can increase the risk of falsepositive findings.

Should we:
Item 24: Report estimates of test reproducibility, if done. Consideration: The word "reproducibility" is ambiguous.
Estimating a test's reproducibility is not an element of most diagnostic accuracy studies.
Many studies refer to other publications or to the manufacturer for information on test reproducibility.

Should we:
Item 25: Discuss the clinical applicability of the study findings.

Consideration:
This item is rather vague, general and not specific for diagnostic accuracy studies.
Many reports of test accuracy studies offer generous and optimistic interpretations of the study findings, with strong recommendations for practice. The following items and issues were identified based on our literature review and comparisons between STARD and other reporting guidelines.
Your comments are always welcome and will be taken into account in the preparations for the second round of the survey. This survey is not anonymous.
Please enter your first name and last name. Many diagnostic accuracy studies report area under the receiver operator curve (AUC ROC).
Without accuracy estimates at specific cutoffs, such a result is difficult to apply.
Should STARD recommend reporting at least one cutoff when reporting AUCROC?

Additional information
Scope of STARD

PROPOSALS FOR NEW ITEMS
Consideration: There is a movement towards more openness and transparency in health research in general. This is not specific for test accuracy studies.
Many other studies also report accuracy estimates, as an additional aim.
Should the applicability of STARD be rephrased, from "diagnostic accuracy studies" to "studies reporting diagnostic accuracy"?

Consideration:
STARD was originally targeted at "diagnostic accuracy studies", which are cross sectional.
In practice, we also see diagnostic studies with socalled "delayed verification", and other studies reporting on prognostic accuracy.
Should the applicability of STARD be extended to prognostic accuracy studies? Medical tests are not just used for diagnosis and prognosis, but also for other purposes, such as screening, monitoring or treatment selection.
Many, if not all, STARD items also apply to studies evaluating the accuracy of such tests.
Should the applicability of STARD be rephrased, from "diagnostic accuracy" to "(clinical) test accuracy"?

Consideration:
The emphasis in STARD was on studies of a single (index) test, but the principles also apply to evaluations of the accuracy of multiple tests, combinations of tests, and multivariable models and rules.
Should the applicability of STARD be rephrased, e.g. in terms of "all evaluations of the accuracy of one or more tests, or combinations of test results and/or other variables"? Consideration: There is a wide variety of terms used to describe elements of a diagnostic accuracy study.
Should STARD recommend preferred terms for indicating...

Yes
No No opinion ...the type of study (e.g. "a diagnostic accuracy study" or "a test accuracy study")?
n m l k j n m l k j n m l k j ...the study design (e.g. cohort/casecontrol or singlegate/multiple gate studies)?
n m l k j n m l k j n m l k j ...the "index test" and the "clinical reference standard"? n m l k j n m l k j n m l k j Open comment box: