A randomised controlled trial of an Intervention to Improve Compliance with the ARRIVE guidelines (IICARus)

Background The ARRIVE (Animal Research: Reporting of In Vivo Experiments) guidelines are widely endorsed but compliance is limited. We sought to determine whether journal-requested completion of an ARRIVE checklist improves full compliance with the guidelines. Methods In a randomised controlled trial, manuscripts reporting in vivo animal research submitted to PLOS ONE (March–June 2015) were randomly allocated to either requested completion of an ARRIVE checklist or current standard practice. Authors, academic editors, and peer reviewers were blinded to group allocation. Trained reviewers performed outcome adjudication in duplicate by assessing manuscripts against an operationalised version of the ARRIVE guidelines that consists 108 items. Our primary outcome was the between-group differences in the proportion of manuscripts meeting all ARRIVE guideline checklist subitems. Results We randomised 1689 manuscripts (control: n = 844, intervention: n = 845), of which 1269 were sent for peer review and 762 (control: n = 340; intervention: n = 332) accepted for publication. No manuscript in either group achieved full compliance with the ARRIVE checklist. Details of animal husbandry (ARRIVE subitem 9b) was the only subitem to show improvements in reporting, with the proportion of compliant manuscripts rising from 52.1 to 74.1% (X2 = 34.0, df = 1, p = 2.1 × 10−7) in the control and intervention groups, respectively. Conclusions These results suggest that altering the editorial process to include requests for a completed ARRIVE checklist is not enough to improve compliance with the ARRIVE guidelines. Other approaches, such as more stringent editorial policies or a targeted approach on key quality items, may promote improvements in reporting. Electronic supplementary material The online version of this article (10.1186/s41073-019-0069-3) contains supplementary material, which is available to authorized users.


Background
There are widespread failures across in vivo animal research to adequately describe and report research methods, including critical measures to reduce the risk of experimental bias [10,13]. Such omissions have been shown to be associated with overestimation of effect sizes [8,13] and are likely to contribute, in part, to translational failure. In an effort to improve reporting standards, an expert working group coordinated by the National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) developed the Animal Research: Reporting of In Vivo Experiments (ARRIVE) guidelines [9], published in 2010.
Since the ARRIVE guidelines were first published, they have been endorsed by many journals in their instructions to authors, but this has not been accompanied by substantial improvements in reporting [2,3,6,14]. Simply endorsing the guidelines does not appear to be sufficient to encourage compliance. Recent findings suggest that following the introduction of mandated completion of a distinct reporting checklist at ten Nature Journals at the stage of first revision significantly improved the quality in reporting versus that of comparator journals [7,12].
PLOS ONE is an open access online only journal which at the time this study began published around 32,000 research articles per year. Of these, some 5000 were described in vivo animal research. At present, PLOS ONE instructions to authors encourage compliance with the ARRIVE guidelines, but do not mandate checklist completion. Journals have an important role to play in ensuring that the quality of reporting in the research they publish is robust, yet the most effective mechanism by which they can achieve this remains unclear.
Our aim was to assess the effect of an email request to authors to complete an ARRIVE checklist on compliance with the ARRIVE guidelines. The email request was at the time of submission and requested that authors state where in the manuscript various components of the ARRIVE guidelines are reported.

Methodology and open data
Our protocol, data analysis plan, analysis code, data validation code, and complete dataset are available on the Open Science Framework (https://doi.org/10.17605/OSF. IO/XSJBV).

Ethical approval
We sought an informal ethical opinion from the BMJ Ethics Committee, who were prepared to consider our proposal although it was slightly out of scope. We did this because we were unable at the time to identify an institutional ethics committee who considered this research to fall within their remit. The majority view of the committee was that it was ethical for manuscripts to be randomised between different handling methods; that it was ethical for authors, peer reviewers, and academic editors to be kept unaware of the existence of the study while it was in progress; and that it was ethical for the study to receive funding from the NC3Rs.
The PLOS ONE editorial process involves an initial screening process, including a determination of whether a manuscript describes animal studies, whether it describes human studies (one manuscript might describe both), and categorises the area of research according to an established taxonomy. For studies reporting the use of animals, checks are carried out to ensure that appropriate institutional animal care and use committee/ethical approvals were in place, and authors of studies perceived to be at high risk-for instance those animal studies which used death as an endpoint-are contacted to provide a valid justification. Manuscripts are then allocated to an academic editor (AE), who assigns peer reviewers as appropriate.
Manuscripts submitted to PLOS ONE between March and June 2015 describing in vivo animal research were randomised using the IICARus web platform to receive standard editorial processing (control group) or checklist completion requests (intervention group). The randomisation sequence was generated using randomisation in the C# programming language and involved minimisation (weighted at 0.75) to ensure that country of origin (of the corresponding author) was balanced between groups. All users of the platform were assigned a level of access in line with their role (e.g. "Trainee", "Reviewer", "Randomiser") to ensure that any individual involved in outcome assessment was blinded to the allocation sequence.
On submission, authors receive an automated acknowledgement from the publisher that their submission had entered a screening phase. For manuscripts identified during screening to include in vivo research and which were randomised to the intervention, corresponding authors were informed in the post screening email that a completed ARRIVE checklist must be completed before the manuscript could advance through the review process. The email advised that this should include details of the page of their manuscript on which each ARRIVE item was addressed (an excerpt from the email used for the intervention is available on the OSF https:// doi.org/10.17605/OSF.IO/XSJBV). If the PLOS editorial team did not receive a checklist, it was sent back to authors once more for completion. Manuscripts by authors who did not complete the checklist after the second contact, for any reason, were still passed to the next stage and continued in the study. The contents of completed checklists were not checked against the manuscript for compliance at any stage.

Blinded manuscript processing
Authors, AEs, and peer reviewers were blinded to the existence of the study. Study personnel took care, in their public comments, not to disclose details of the study or the journal at which the study was being conducted. The journal was not named in the study protocol. If authors enquired as to why their manuscript was being processed differently, they were to be advised that these differences were due to variation within the editorial team in the intensity with which they pursued efforts to improve the review process.
For studies randomised to the control group, PLOS ONE processed the manuscript according to their normal editorial processes.
Once a final decision regarding publication was made, the pre-publication materials for accepted manuscripts were collated by the PLOS editorial team. Where an ARRIVE checklist was included in the accepted materials for publication, this was redacted, along with any reference in the text to the submission of a completed ARRIVE checklist. The format of manuscripts largely excluded any evidence that the manuscript was submitted to PLOS ONE. If a reference to PLOS ONE was discovered in the text by internal outcome assessors (within our research group), this was also redacted to prevent any change in behaviour which may result from external outcome assessors knowing which publisher was involved in the study. Where authors stated that the work complies with the ARRIVE guidelines this statement was not redacted. Redacted PDFs of all materials were provided to our research team and uploaded to the IICARus web platform.

Outcome assessment
Primary, secondary, tertiary, and feasibility outcomes are confirmatory and were pre-specified in our study protocol (https://doi.org/10.17605/OSF.IO/XSJBV). The unit of analysis for all outcomes was the manuscript.
Our primary outcome was to assess whether the proportion of publications in each group considered to fully comply with all of the ARRIVE criteria (at the level of the 38 subitems) was independent of group allocation.
To assess compliance with the ARRIVE guidelines in greater detail, we assessed reporting at the subitem level. The Landis core reporting standards [11] set out the key aspects of study design and conduct necessary to improve transparency of in vivo research and allow readers to ascertain the validity of the findings reported. We therefore wanted to assess compliance with the ARRIVE items which form these criteria, namely, reporting of randomisation to experimental groups (subitem 6b), reporting of blinded outcome assessment (subitem 6b), reporting of a sample size calculation (subitem 10a), and reporting of animal exclusions (subitem 15b). In addition, we aimed to investigate any effects our intervention may have on the process of publication by assessing the number of manuscripts accepted for publication in each group.
Our secondary outcomes were therefore to assess whether: The proportion of publications meeting each of the individual 38 ARRIVE subitems was independent of group allocation (intervention/ control) The proportion of studies reporting all Landis criteria subitems present in the ARRIVE guidelines was independent of group allocation The proportion of submitted manuscripts accepted for publication was independent of group allocation We also wanted to examine whether different domains, countries, or research that includes human research may be associated with differences in adherence to the ARRIVE guidelines.
For our tertiary outcomes, we assessed whether: The proportion of publications meeting each of the 38 ARRIVE subitems was independent of group allocation, stratified by experimental animal The proportion of studies reporting all of the Landis criteria subitems (blinded assessment of outcome, sample size calculation, and criteria for exclusion of experimental subjects), stratified by experimental animal, was independent of group allocation The proportion of publications meeting each of the 38 ARRIVE subitems was independent of group allocation, stratified by the country of the address of the corresponding author The proportion of publications meeting each of the 38 ARRIVE subitems was independent of group allocation, stratified by whether or not the research also contains human data To examine the feasibility of implementing requests for ARRIVE checklist completion at PLOS ONE, we assessed outcomes relating to the duration of processing manuscripts to gain an insight into potential costs to the journal. We therefore assessed the following for accepted manuscripts in each group: Time (days) spent in PLOS editorial office in handling the manuscript (prior to editor assignment). Time (days) from manuscript submission to AE assignment. Time (days) from AE assignment to first reviewer agreed. Time (days) from AE assignment to first decision Time (days) from receipt of last review to AE decision In addition, we assessed the following outcomes in manuscripts that were accepted following resubmission: Time (days) from initial decision letter to resubmission Number of cycles of resubmission Time (days) from resubmission to final decision We also conducted some exploratory analyses not defined in the study protocol. Although the majority of authors complied with the request to complete an ARRIVE checklist, a small number did not. By limiting our analyses to the manuscripts that had received the intervention (equivalent to an "on treatment" analysis), we assessed: The proportion of publications meeting each of the 38 ARRIVE subitems in the control and "on treatment" intervention group The Landis reporting criteria are minimal reporting standards which are important to quality improvement efforts. Given that the reporting of randomisation to experimental groups and blinded outcome assessment are part of the same subitem (6b) in the ARRIVE guidelines, we explored each of the Landis criteria items individually to assess: The proportion of publications meeting each individual Landis criteria in each group We divided the 20 main ARRIVE items into 38 subitems. These subitems were further operationalised 108 questions (Additional file 1) which were scored by trained outcome assessors on the web platform. Two of these questions (0.1.0 What animal species are used in this research? and 0.2.0: Does the manuscript include human study?) were used only to categorise research for analysis purposes (Tertiary outcomes) and were not included in assessments of compliance. It was later determined by the steering committee that six questions from the original 108 were not strictly required to comply with the ARRIVE checklist (see Additional file 1; removed questions highlighted in grey) and were therefore also excluded from the analysis. The decision to remove data from these six questions was taken after data collection, but prior to the unblinding of group allocation.
PDF files of manuscripts were available alongside the scoring questions. Each manuscript was scored by two independent reviewers who were blinded to both intervention status and to the score given by the alternative reviewer. Manuscripts were presented to reviewers in random order, and the platform did not allow the same user to review the same manuscript twice. Discrepancies between reviewers were reconciled by a third independent reviewer, who could view both previous scores. For some manuscripts, some questions were not applicable (e.g. questions relating to fish studies in a manuscript describing rat experiments). Compliance was only assessed against questions that were applicable to the study.
There were several deviations from the outcome measures specified our study protocol. The time spent in the PLOS editorial office was not disentangled from time with the authors; therefore, it includes time for the authors to follow any copyediting changes and requests for documents (including the request to complete an ARRIVE checklist). Similarly, the time spent with authors was also included in the time from manuscript submission to AE assignment. In addition, we had originally intended to analyse the time in the PLOS editorial office in minutes, but the measurement of this was not feasible. We were unable to analyse "The proportion of submitted manuscripts accepted for publication, stratified by experimental animal" (Secondary outcome measure) as we did not receive species categorisation data for studies which were not accepted. PLOS ONE were unable to provide us with one of our specified feasibility measures "Time (days) for each reviewer, from solicitation of reviews to receipt of reviews)" and instead provided "Time (days) from AE assignment to first decision". "Time (days) from AE assignment to first reviewer agreed" also differs to the outcome described in our study protocol ("Time (days) from AE assignment to solicitation of reviews"). In addition, we had originally set out in our protocol that we would look at the following feasibility outcome measures for manuscripts when the decision was other than "Accept" or "Reject": whether a revised manuscript is submitted, time (days) from initial decision letter to resubmission, number of cycles of resubmission, and time (days) from resubmission to final decision. However, we did not attain this information for manuscripts which were not eventually accepted. Therefore, all feasibility measures apply to accepted manuscripts only.

Reviewer training
This was a challenging project, and we used crowdsourcing to recruit additional reviewers external to our research group. We used our research networks and social media to identify researchers and students across the biomedical sciences and recruit them as outcome assessors for the project. As an incentive, rewards were given to external reviewers who reached a pre-specified number of manuscript reviews or completed the most reviews in a certain time period.
To ensure that review quality was high, we required reviewers to complete online training prior to reviewing manuscripts as part of the project. We developed a training program with a pool of 10 manuscripts for which we described "Gold standard" correct answers with explanations and an accompanying document with further elaboration (https://doi.org/10.17605/OSF.IO/XSJBV). To successfully complete the training, external reviewers had to score 80% against these gold standard answers overall and score 100% on gold standard questions relating to the Landis criteria subitems for three consecutive training manuscripts. The training platform remains available (https:// ecrf1.clinicaltrials.ed.ac.uk/iicarus/) and can still be used as a training tool for assessing manuscripts against the ARRIVE guidelines.

Power calculations
When the study was being designed, the PLOS ONE editorial team estimated that complete compliance with the ARRIVE guidelines was close to zero. To have 80% power with an alpha of 0.05, to detect an increase in full compliance from 1 to 10% (the primary outcome) would require 100 published manuscripts per group. To examine each of the individual 38 ARRIVE subitems (Secondary outcome), after correction for multiplicity of testing (alpha = 0.0013), we would require 200 published manuscripts per group to detect with 80% power an increase from 30 to 50% in the prevalence of reporting of an individual subitem. It was estimated that at time of the trial PLOS ONE accepted around 70% of manuscripts, and to account for some drop out because of the use of the same academic editor, we increased our group estimate to 150 manuscripts per group for the primary outcome, and 300 manuscripts in each group for secondary outcomes. During the course of the study, it appeared that acceptance rates were lower than the estimate, and so we increased target recruitment to 1000 manuscripts, of which we estimated 600 would be accepted for publication. Despite the risk of overpowering, we did not curtail the study when we had reached the required number of manuscripts accepted for publication because we were concerned that manuscripts with short submission to acceptance times would be enriched in the study population and might not be representative of all manuscripts.

Data validation
We validated the dataset, blinded to group allocation, to minimise errors. For example, where a manuscript was assessed as not including fish experiments, "Yes" or "No" responses to IICARus questions only relevant to fish species (e.g. Additional file 1, Questions 9.2.3-9.2.4) should not have been recorded. The R code for validation, with explanations of each response validated, and the changes we made to the data are available on the OSF. These were uploaded prior to the unblinding of the final results, at which point database lock occurred and the data were not subsequently altered in any way.

Statistical analysis
All analyses were carried out using RStudio v1.0.143 with the level of statistical significance set at p < 0.05, corrected as appropriate for multiple comparisons. Our full statistical analysis plan and accompanying R code was uploaded to the OSF prior to database lock.
We performed logistic regression with group allocation and corresponding author country of origin included as independent variables to determine any effects on full compliance (primary outcome) and compliance with each of the 38 subitems (secondary outcome), adjusting for the minimised randomisation. For our primary, secondary, and tertiary outcome measures, we used the chi-squared test of independence to test whether compliance was independent of group membership (intervention/control).
To determine if the differences between proportions is meaningful for each outcome measure, effect sizes were calculated using Cohen's H. For feasibility outcomes, medians and interquartile ranges were calculated and the Mann-Whitney U test was used to test whether a significant difference existed between groups. To control the familywise error rate of multiple comparisons, the Holm-Bonferroni method [1] was used to adjust p values for secondary, tertiary, and feasibility outcomes. Only means and confidence intervals were calculated for our exploratory analyses. A full description of our data analysis plan is available on the OSF: https://osf.io/zgqxk/.

Statistical considerations
The proportion of compliant manuscripts was assessed based on the number compliant manuscripts divided by the number of applicable manuscripts. In some cases, particularly when stratifying by manuscript country of origin or animal species, the number of manuscripts in each group is very low. If the number of applicable manuscripts for any subitem (with or without stratification) was less than 10, we did not perform statistical analysis.
The chi-squared test of Independence relies on the assumption that no more than 20% of expected counts are less than 5 and that no individual expected counts are less than 1. In cases where counts were less than 5, a Fisher's exact test was used.
Unpaired t tests rely on a normal distribution; therefore, if the distribution was non-normal, the Mann-Whitney U test, a non-parametric alternative, was used and summary medians and interquartile ranges were presented. In the case of parametric data with unequal variance between groups, Welch's t test was used due to higher reliability.

Results
We randomised 1689 PLOS ONE manuscripts: 845 manuscripts to the intervention and 844 to control. Of these, 192 were rejected by the journal and 75 manuscripts were withdrawn. A further 153 were excluded due to errors, including 77 that were found not to contain in vivo animal research. The reasons for these errors are shown in Additional file 2. Of the remaining 1269 manuscripts, 672 were accepted for publication (340 control, 332 intervention) and underwent web-based outcome assessment (Fig. 1). Manuscript allocation to group and the corresponding number of manuscripts from each country which were randomised and accepted are shown in Table 1. No authors questioned the differences in manuscript processing occurring within the intervention group. A complete dataset detailing the proportion compliance for each of the 108 questions is available online (https://doi.org/10.17605/OSF.IO/XSJBV).

Quality of outcome assessment
Three hundred sixty individuals registered with the online platform; 47 completed reviewer training, and 42 contributed at least one outcome assessment. The percentage agreement between the first and second reviewer for each manuscript was high. For 71.6% (481/672) of manuscripts, reviewers were in agreement on at least 80% of the questions. The agreement of reviewers varied considerably at the level of each of the 108 individual questions (Additional file 3: Table S1), from a kappa coefficient of 0.90 (0.86-0.93) for Question 1.1 (Is the species of animal model studied reported in the title?) to a worse than chance kappa coefficient of − 0.03 (− 0.10 --0.04) for Question 13.2 (Is the unit of analysis for at least one test explicitly specified?). This distribution of kappa agreement is displayed in a histogram (Additional file 3: Fig. S1).

Primary outcome
No manuscript achieved full compliance with the ARRIVE checklist; therefore, there was no difference between the control and intervention groups (X 2 = 0.1, df = 1, p = 0.76). Compliance with individual ARRIVE subitems ranged from 8 to 65%. The median compliance was 36.8% and 39.5% of relevant subitems in the control and intervention groups, respectively.

Logistic regression
Country of corresponding author had no influence on compliance either overall or for any individual subitems. Only one subitem had improved reporting in the intervention group versus control, subitem 9b (Provide details of husbandry conditions e.g. breeding programme, light/dark cycle, temperature, quality of water etc for fish, type of food, access to food and water, environmental enrichment) (increased log odds of compliance by 1.03 (p < 0.0001).

Secondary outcomes Compliance with individual ARRIVE subitems
Only one ARRIVE subitem had significantly improved compliance in the intervention group (Table 2). ARRIVE Reporting of animal characteristics and health status (subitem 14) was very low, with 0.29% (1/339) and 0% (0/332) compliance in the control and intervention groups, respectively. Similarly, reporting of animal housing (subitem 9a); adverse events (subitem 17b); the order of treatment and assessment (Item 11b); implications for replacement, refinement, or reduction (subitem 18c); defining primary and secondary outcomes (subitem 12); and rationale for experimental procedures (subitem 7d) was low, with less than 5% of manuscripts reporting each of these items in both groups. Figure 2 shows the percentage compliance in each group for each ARRIVE subitem in each section of the manuscript.

Reporting of Landis 4 subitems
Reporting of the Landis 4 criteria (blinding, randomisation, animal exclusions, and use of a sample size calculation) was low and did not differ significantly between groups (Fisher's estimate for difference = 0.61, df = 1, p = 0.73). 1.5% (5/340) of the control group manuscripts) and 0.9% (3/332) of intervention group manuscripts reported all four subitems of the Landis criteria.

Manuscript acceptance
There was no significant difference in the proportion of accepted manuscripts between the control and intervention

Tertiary outcomes Compliance by animal species
We removed animal species from the analysis where fewer than ten manuscripts reported the use of a species for control and intervention groups, leaving only rat and mouse studies. In studies involving mice, only reporting of ARRIVE subitem, 9b (Provide details of husbandry conditions e.g. breeding programme, light/dark cycle, temperature, quality of water etc for fish, type of food, access to food and water, environmental enrichment) increased significantly from 49.5% (105/211) in the control group to 70.2% (135/192) in the intervention group (X 2 = 16.8, df = 1, p = 0.003). No subitem had a statistically significant difference between groups in rat studies.
Results are summarised in Tables 3 and 4. There was no difference in Landis 4 compliance between animal species.

Feasibility measures
Re-assignment of academic editors occurred in a small number of cases (7/672), which confounds the recorded time in each stage and prevented us from analysing the feasibility outcomes for these manuscripts. The time from receipt of last review to final AE decision was missing from a large proportion of the remaining manuscripts (342/665), and so this analysis was not performed. Ten additional manuscripts were also excluded from the feasibility dataset due to missing data on one or more feasibility outcome s. After these exclusions, the feasibility analysis was performed on 328/340 manuscripts in the control group and 327/332 in the intervention group. For analysis of resubmitted articles, seven manuscripts were removed as these were accepted at first decision leaving 323/340 in the control group and 325/332 in the intervention group. Upon examining histograms for each variable and calculating skew using the skew() function, all data for these outcomes were found to be right skewed, so we used Mann-Whitney U test to compare timings between groups. Time spent in the PLOS editorial office was significantly higher (p < 0.0001) for manuscripts in the intervention group with a median of 9 days (interquartile range [IQR] = 6-16.5) compared to the control group with a median of 6 days [3][4][5][6][7][8][9][10]. Time from submission to   academic editor assignment was also significantly higher in the intervention group (13 days, range 9-22) than in the control group (9 days, range 7-14) (p < 0.0001). No statistically significant differences were identified for other feasibility outcomes (Table 5).

Compliance by country
There were no statistically significant differences in compliance between control and intervention groups across any corresponding author country of origin. Although we did not set out to compare differences in compliance with different ARRIVE subitems across countries, we present these data in Fig. 3.

Human studies compliance
In manuscripts without human subjects, reporting of one ARRIVE subitem, 9b (Provide details of husbandry conditions e.g. breeding programme, light/dark cycle, temperature, quality of water etc for fish, type of food, access to food and water, environmental enrichment) increased significantly from 52.4% (172/316) in the control group to 76.9% (227/295) in the intervention group (X 2 = 33.2, df = 1, p < 0.0001). In manuscripts containing human subjects, compliance also rose from 20.8% (5/24) to 51.35% (19/37) in the intervention group for this subitem, although we were limited by small sample sizes and this change was not found to be statistically significant (X 2 = 4.47, df = 1, p = 1).

Exploratory outcomes Compliance in true intervention group
Despite allocation to the intervention group, a small subset (n = 31/332) of authors did not comply with the request to submit a completed checklist, and therefore, 31 manuscripts were in the intervention group without a completed ARRIVE checklist. We sought to determine compliance with each of the 38 subitems in the "true" intervention group (those submitted with a completed checklist), compared to the control group. The pattern of compliance is similar to that of the full intervention group compared to controls, suggesting that these instances of non-compliance did not impact on results. Summary statistics are presented in Table 6.

Landis item individual compliance
To determine if there had been any changes in individual Landis subitems, we investigated randomisation, blinding, reporting of a sample size calculation, and reporting of exclusions separately (Fig. 4). Although we did not analyse these comparisons using inferential statistics, there appears to be some improvements in reporting of ran-

Discussion
Requesting completion of an ARRIVE checklist at submission did not increase full adherence with the ARRIVE guidelines. Compliance with the operationalised ARRIVE checklist was poor overall, with no manuscripts in either group even approaching full compliance; the median compliance was less than 40%, equivalent to around 15 of 38 subitems; and the intervention only increased compliance with one subitem, reporting of animal husbandry conditions. There is considerable room for improvement, and this study shows that an editorial policy of making ARRIVE checklist completion "mandatory" without compliance checks has little or no impact. There were some noticeable differences between the control and intervention group for some subitems e.g. the proportion of manuscripts compliant with 10b (Explain how the number of animals was arrived at. Provide details of any sample size calculation used) was double the size in the intervention group compared to control (3.53% vs 7.53%; Cohen's H effect size = 0.18). However, our power calculation determined that a meaningful effect would be substantially larger to justify the increased burden for authors associated with implementation by PLOS ONE.
It may be that simply requesting that authors complete checklist, without any additional editorial checks to determine whether the checklist is truly indicative of compliance, may not be enough to improve adherence to the ARRIVE guidelines. Adherence to the reporting guidelines within in the clinical literature such as CONSORT and STROBE have been widely assessed and may inform interventions to improve compliance with preclinical guidelines. Journal endorsement of these guidelines appear to have improved reporting quality [15,18]; however, it is often unclear what actions journals take to promote adherence [17] and the extent of editorial involvement is likely to have an impact. Prior reports indicate that assessing compliance with reporting guidelines at the stage of peer review leads to a significant improvement of reporting quality [5]. Other approaches (e.g. actions on the part of funders or institutions) may also be beneficial, but a successful strategy is likely to be multi-dimensional. Further, the findings reported here and the limited agreement between outcome assessors both in this study and in the recent investigation of study quality following the introduction of a new editorial policy at Nature journals [12] suggests that an important part of guideline development should be refinement of the content, the number of items (with fewer generally being better), and the agreement between assessors. It may be that a more formal adoption of research improvement strategies, with an original focus on a smaller number of items judged by a stakeholder to be of greatest importance, will allow an incremental approach to enabling and measuring improvement.
Our findings are in line with prior reports that endorsement by editors and reviewers has not significantly improved reporting of ARRIVE quality items [3,6]. We need therefore a better understanding of the barriers to implementing quality checklists for animal experiments. It has been suggested that requesting checklist adherence at the submission stage may be too late, given the observed correlation between reporting at the planning application stage and at the publication stage [19]. The PREPARE (Planning Research and Experimental Procedures on Animals: Recommendations for Excellence) guidelines [16] were published recently and may be a useful tool, in combination with the ARRIVE checklist, to promote a greater focus on experimental rigour at all stages of the research cycle. Our results contrast with recent reports of improvement in quality following mandated checklist completion following a change in editorial policy at Nature journals [7,12]. However, in both reports, study quality was retrospectively assessed in publications published prior to and after the introduction of the Nature quality checklist, which was established in 2015 as part of an organisation wide approach with substantial editorial involvement. In contrast, the current trial investigated an intervention targeted at selected manuscripts, without further editorial involvement.
Perhaps unsurprisingly, due to the additional time required for ARRIVE checklist requests, both the number of days manuscripts spent in the PLOS editorial office and the number of days from manuscript submission to AE assignment were found to be significantly longer in the intervention group. The editorial resource required to ensure that all accepted publications meet the requirements of the ARRIVE checklist is likely to be considerable, given that PLOS ONE is a high-volume publisher, with around 44,000 submissions per year. The most feasible and effective way to encourage compliance to the ARRIVE guidelines, or indeed any reporting guideline, remains to be determined, but an ongoing review of interventions to improve adherence to reporting guidelines may shed some light on this issue and direct future investigations [4]. Another consideration is the perceived clarity of the checklist to authors and reviewers. Although reviewer agreement was generally high, a few questions were less well understood by our outcome assessors which suggests the current guidelines may require clearer dissemination among the research community.

Limitations
Due to modest sample sizes, we were unable to investigate whether the intervention was more successful in countries with high awareness and adoption of the ARRIVE guidelines such as the UK, where the ARRIVE guidelines were developed and where many institutions have endorsed them. Furthermore, we did not perform a power calculation for outcomes beyond our primary and main secondary outcomes and it is possible that this, coupled with stringent adjustments for multiplicity of testing in some instances, may have prevented us from detecting any significant differences. Furthermore, our intervention only involved requests for authors to complete an ARRIVE checklist. PLOS ONE did not fully mandate checklist completion, as manuscripts without a checklist were still allowed to proceed through the trial. Furthermore, PLOS ONE did not evaluate the accuracy of the completed checklists against each manuscript. It is possible that further emphasis on evaluation and checklist adherence may result in an enhancement of study quality.
Our interpretation of compliance was also influenced by our operationalisation of the ARRIVE checklist used for outcome assessment. It was often difficult to determine how many of the details provided in the ARRIVE guidelines were sufficient for full compliance to that ARRIVE subitem.
There were unforeseen difficulties in attaining data for some outcomes, which meant that we could not assess all outcomes presented in our study protocol. This was most apparent for feasibility outcomes, where there were substantial deviations from our protocol. Furthermore, the project was subject to research waste due to overpowering our primary and secondary outcome measures. As manuscripts submitted to PLOS ONE as part of the study were treated differently, the existence of the study could have leaked to external sources; however, to the best of our knowledge, this was not the case.
Since the ARRIVE guidelines were developed by the NC3Rs, no individual employed by the NC3Rs was permitted to conduct any outcome assessment on the IICARus platform.

Conclusions
Research must be described in sufficient detail to allow research users critically to appraise experimental design, to allow them to assess the validity of the findings presented. Replication studies require, for their design, full details of what was done. Transparency in the reporting of research is paramount. Manuscripts must therefore be described in enough detail for readers to understand