Do simple screening statistical tools help to detect reporting bias?
 Romain Pirracchio^{1, 2}Email author,
 Matthieu RescheRigon^{2},
 Sylvie Chevret^{2} and
 Didier Journois^{1}
https://doi.org/10.1186/21105820329
© Pirracchio et al.; licensee Springer. 2013
Received: 7 May 2013
Accepted: 16 August 2013
Published: 2 September 2013
Abstract
Background
As a result of reporting bias, or frauds, false or misunderstood findings may represent the majority of published research claims. This article provides simple methods that might help to appraise the quality of the reporting of randomized, controlled trials (RCT).
Methods
This evaluation roadmap proposed herein relies on four steps: evaluation of the distribution of the reported variables; evaluation of the distribution of the reported p values; data simulation using parametric bootstrap and explicit computation of the p values. Such an approach was illustrated using published data from a retracted RCT comparing a hydroxyethyl starch versus albuminbased priming for cardiopulmonary bypass.
Results
Despite obvious nonnormal distributions, several variables are presented as if they were normally distributed. The set of 16 p values testing for differences in baseline characteristics across randomized groups did not follow a Uniform distribution on [0,1] (p = 0.045). The p values obtained by explicit computations were different from the results reported by the authors for the two following variables: urine output at 5 hours (calculated p value < 10^{6}, reported p ≥ 0.05); packed red blood cells (PRBC) during surgery (calculated p value = 0.08; reported p < 0.05). Finally, parametric bootstrap found p value > 0.05 in only 5 of the 10,000 simulated datasets concerning urine output 5 hours after surgery. Concerning PRBC transfused during surgery, parametric bootstrap showed that only the corresponding p value had less than a 50% chance to be inferior to 0.05 (3,920/10,000, p value < 0.05).
Conclusions
Such simple evaluation methods might offer some warning signals. However, it should be emphasized that such methods do not allow concluding to the presence of error or fraud but should rather be used to justify asking for an access to the raw data.
Keywords
Reporting bias Reporting CONSORT FraudBackground
In a world where medicine is supposed to be based on evidence, the question of how evidencebased the published results should be appraised and translated into clinical practice is of crucial importance [1, 2]. To facilitate the appraisal, important efforts have been made to propose reporting checklists, trials registries, and to extend conflictofinterest disclosure. The CONSORT [3] statements has undoubtedly helped to improve the reporting of clinical trials. The requirement of a registration number helps the reader to verify that the methodology and the endpoints have not changed after completing patients’ enrolment [4, 5]. Moreover, many journals now ask for professional statisticians to review the manuscripts [6]. Despite such efforts, the high prevalence of reporting bias as well as several examples of highgrade research findings published in major medical journals, but refuted by subsequent evidences, maintain doubts around evidencebased medicine [7, 8].
In an ideal world, major scientific journals should ask the researchers to provide their raw data to allow an external verification of the results [8]. However, while such policies are lacking, it is currently difficult to verify the accuracy of the published data. It is the editors, reviewers, and readers’ responsibility to appraise the research reports, before translating the results into clinical practice. In this context, we proposed some simple statistical screening tools that could help to detect certain types of reporting bias [9]. Our objective was to test whether such simple tools applied to a manuscript known to be fraudulent [10, 11] would have helped to detect some warning signals of poor quality. Such warning signals would have justified asking the trialists for more detailed information concerning the raw data.
Methods
We first provide a concise description of the methods previously proposed [9]: evaluation of the distribution of the reported variables; evaluation of the distribution of the reported p values; parametric bootstrapping and explicit computation of the p values.
Variable distribution
In many papers, data are reported as if they were normally distributed, while usually, the authors only assume that the data are normally distributed. First, the use of summary statistics, such as means and standard deviations, might not adequately reflect the distribution of a nonsymmetrically distributed variable. Although rarely followed, the CONSORT statements recommend using mean and standard deviations (eventually associated to the interquartile range) for symmetrically distributed variables, and median with its interquartile range for other variables [3]. Worse still, parametric statistical tests, such as the Student t tests or Analysis of Variance (ANOVA) often are inappropriately used [12]. Such tests might be inappropriate to analyze nonnormally distributed or skewed data, especially when the sample size is not large enough. In such situations, alternatives might either rely on using nonparametric statistical tests or only on comparing the confidence intervals without any statistical tests [13]. However, the fact that a variable is nonnormally distributed does not necessarily imply that the result of a parametric test is invalid. T test, for instance, is quite robust to nonnormal data providing sample size is not small and the data are not too skewed. It also should be emphasized that several transforming functions have been proposed to convert nonGaussian distributions to gaussian form, to eliminate or substantially reduce nonGaussian characteristics of positive skewness and peakedness [13].
Hence, it is of interest to analyze the distribution of reported variables. First, the reader should question whether a given variable could intrinsically behave normally or not. For example, duration variables (length of stay in hospital or a length of mechanical ventilation, for instance) are usually not normally distributed [12]. Because duration cannot be negative, but can trend toward infinity, its distribution is usually asymmetric. Second, the reported summary statistics can help the reader to refute the assumption of normality, for example, when a strictly positive variable has a standard deviation close to or even larger than its mean. This means that the variable distribution is wide, but also that, given the fact that negative values are impossible, its distribution is likely to be asymmetric. Third, critically looking at the summary statistics can help the reader to detect some lack of variability in the data, which can be related to data smoothing for instance. This point can easily be checked by looking at the variability reported in the literature for similar measures.
Baseline covariate distribution between groups
Statistical testing should be avoided when evaluating covariate balance, because usual tests are not designed to accept the null hypothesis. However, most published manuscripts report such statistical tests. Analyzing the results of such statistical tests could help to detect poor quality data.
P values distribution
If the randomization was adequately performed, baseline characteristics distribution should be balanced between the two randomized groups. Under such a null hypothesis (i.e., the two groups have similar baseline characteristics), the p values referring to the comparisons of baseline independent characteristics should follow an Uniform distribution over the interval [0,1] [14]. In other words, the p values should be uniformly distributed between 0 and 1. In case of fraud, the authors are usually tented to produce p values all close to 1. Hence, when p values are all close to 1, one should probably consider this result as a warning signal. As an example, we propose to distinguish a twoclass partition of the p values (<0.5, ≥0.5), to check graphically the observed distribution of the reported p values in these two classes and to see if it is compatible with Uniform distribution. Of course, this evaluation relies on the fact that the authors reported all baseline characteristics that they compared. If they chose not to report some of the baseline covariates, such an adequacy test to Uniform distribution could not be adequately performed on the subset of reported p values due to missing—and potentially informative—p values.
Explicit p value computations
Based on the reported summary statistics (means and standard deviations for instance), one also could compute the p value of a Student t test (formula provided in Additional file 1). The p values for Fisher and Chisquare tests also can be computed retrospectively from the data reported in a contingency table. Moreover, it should be emphasized here that p values are probabilities that are use to reject or not the null hypothesis according to some prespecified rejection rules. Multiplying statistical tests intrinsically inflates the type I error, i.e., the probability to reject the null while in fact it is the truth. To address this risk, one should adapt the prespecified rejection rules—that is adjust the p value threshold for statistical significance [15].
Parametric bootstrap
The parametric bootstrap [16] is a simulation procedure that consists in randomly generating many independent datasets on the basis of the reported characteristics of the population and some knowledge about the distribution law. Simulations allow one to recreate a large number of virtual datasets retrospectively. Simulating a dataset requires: 1) a location parameter (mean or median for example); 2) a dispersion parameter (standard deviation or 95% confidence intervals for example); and 3) some knowledge about the distribution law. The parametric bootstrap consists in randomly generating N independent datasets of size n on the basis of the reported sample size, mean and variance (Additional file 1). Let’s consider that a variable, such as age, is reported with a mean of 60 years with a SD of 40 in a sample of size 1,000. If one assumes that age is usually normally distributed in the population, one can use the reported mean and SD to draw N independent sample from a Normal distribution with mean 60 and SD 40. The simulated datasets may first be used to look at the distribution of the data with two specific questions: 1) Is this distribution possible? (e.g., it is obviously impossible to observe negative values for a variable such as serum creatinine); 2) Is the distribution compatible with the variable itself? (e.g., variable for which the measurement method has a large intrinsic variability should have a variance in agreement). Second, it is possible to rerun the statistical analyses in each simulated dataset using the same tests as those used by the authors and building the distribution of the p values.
It should be emphasized that such an approach does not aim to provide new inference but only to detect some potential inconsistencies. Moreover, variable simulation from reported means and standard deviations rely on an assumption concerning the underlying distribution. Because some variables are obviously not normally distributed, it could be of interest to compare simulations under normal distribution to simulations obtained under alternate distributions. Finally, simulating 10,000 datasets is not the same as “redoing 10,000 times the same clinical trial, with the same sample size.” In such simulations, we assume observed means and standard deviations are the true population parameters, while they are in fact the observed parameters of a random sample drawn from the underlying population.
Illustrative example
To illustrate these methods, we selected the data from a randomized study published in 2009 [10] that was based on two groups of 25 patients (Additional file 2). We chose this paper because it is known to be fraudulent and it was retracted by the EditorinChief of the journal in 2010 [11]. To perform our retrospective simulations, we selected in this paper the continuous variables, the mean and the standard deviation of which were tabulated. Assuming that the reporting was appropriate, we considered that, when such variables were expressed as mean (SD), they should be normally distributed. We then used these reported means, standard deviations and the normal distribution assumption to build N = 10,000 datasets of size n = 50 (25 subjects in each group). Based on these simulated datasets, we reran the statistical analysis using Student t tests, and plotted the distribution of the observed p values across the 10,000 datasets. In addition, to address the skewed distribution of some variables, we also used the reported mean and standard deviations to simulate lognormal observations, and reran the analyses using nonparametric ranksum tests.
We identified 19 tabulated continuous variables supposed to be normally distributed and for which a mean and a standard deviation were reported.
Results
Variables distribution
P values distribution
Under the null hypothesis of no intergroup difference after adequate randomization, the set of 16 p values presented in the Additional file 2: (Table S1a) should follow a Uniform distribution on (0,1). This means that, among 16 random values, on average, 8 are expected to be <0.5. In the present case, however, more p values were >0.5 (Figure 1B).
Explicit computations

Urine output at 5 hours: p < 10^{6} (comparison reported as nonsignificant by the authors)

PRBC during surgery: p = 0.08 (comparison reported as statistically significant in the paper)
Parametric bootstrap
Distribution of the p values corresponding to the comparison of the two groups across the 10,000 simulated datasets
p< 0.05  

Variable  Normal distribution  Lognormal distributions 
distribution  distributions  
Colloids 5 hr after surgery  9,113/10,000  8,969/10,000 
Urine output 5 hr after  9,995/10,000  9,985/10,000 
surgery  
PRBC volume  
During surgery*  3,920/10,000  5,371/10,000 
5 hr after surgery*  5,114/10,000  6,601/10,000 
Until first POD*  5,533/10,000  7,133/10,000 
Until second POD*  4,633/10,000  5,742/10,000 
FFP  
during surgery  879/10,000  2,133/10,000 
5 hr after surgery*  4,377/10,000  8,412/10,000 
Until first POD*  5,880/10,000  9,446/10,000 
Until second POD*  5,874/10,000  9,424/10,000 
Discussion
We proposed a critical appraisal of the results of randomized control trials based on a multisteps procedure (Figure 1): evaluation of the distribution of the reported variables; evaluation of the distribution of the p values reported for the comparison of the baseline characteristics of the two groups; explicit computation of the p values and parametric bootstrap. The roadmap does not aim at diagnosing fraud or bias but rather at detecting some potential warnings that could be used by reviewers or editors as justifications to address queries on the raw data. If present, such warnings do not necessarily imply fraud but might only be related to some misuse of the statistical methods. Whether or not such misuses rely on fraud can be adjudicated by analyzing the raw data with variety of methods. Thereafter, the reluctance to share data might be considered as a further clue of poor quality study. Indeed, Wicherts et al. [17] recently reported that authors’ reluctance to share data is associated with more error in reporting of statistical results and with weaker evidence. Providing the availability of raw data, some authors have proposed to check more precisely the data distribution. For instance, looking at the kurtosis of the distribution, which is a measure of the shape of the distribution, might be of interest. Even if even an astute cheater would preserve the mean and the variance, he may be tripped up by examination of the kurtosis [6]. Another approach is to look at the reported numbers and especially at their digits. Indeed, the Benford’s law [18] stipulates that, under certain conditions, the distribution of the first digit of a variable follows a special logarithmic distribution [15]. This law might be used to check the randomness of the numbers reported in a paper [6]. Eventually, Masicampo et al. [19] focused on the prevalence of p values just greater than 0.05. They examined the distribution of p values reported in a large subset of papers from three highly regarded psychology journals. P values were much more common immediately less than 0.05 than would be expected based on the number of p values occurring in other ranges. The authors discussed potential sources of this pattern, including publication bias and researcher degrees of freedom. Those approaches could in turn be associated with those we suggested.
If absent, such warning signals do not exclude the presence of publication bias. Moreover, careful analysis might still fail to address other sources of bias, such as selective reporting (e.g., reporting of observations that have reached significance only) and publication bias (e.g., selective, faster, and more prominent publication of research findings that produced unusual, very strong, highly significant results) [17].
To illustrate our fourstep procedure, we have applied it to analyze the data reported by Boldt et al. in an article that has been recently retracted for fraud [10, 11]. It should be pointed out that the EditorinChief decided to retract the article after several letters were sent to the journal pointing out the surprisingly small variability in cytokines measurements reported in the paper. Using this article as an example helped us to illustrate that, even though each step of our approach might be well known by statistically trained reviewers or readers, strictly applying this fourstep procedure could have provided some important warning signals when apprising Boldt’s manuscript.
This series of analyses have some limitations. First, these analyses alone may not enable to discriminate between low and highquality data, because there is variety of sources of bias that may not be explored. However, in the context of medical diagnosis, we usually oppose screening vs. diagnostic tools. We ask a diagnostic tool to be very specific, while we essentially ask a screening tool to be sensitive. The objective is to offer to the patients a strategy based on a first “screening” line of exams (highly sensitive), which in case of positive results, would need to be confirmed by a second line of diagnostic exams (highly specific) in order to detect the potential falsepositives. We think that our series of analysis should be considered as a second line of exams. It is definitely not sensitive enough to detect fraud or poor quality data in the whole “population” of manuscripts. The first “screening” line should be the Reviewer (or when the manuscript is already published, the reader’s own criticism sense) and even better, the statistical Reviewer. When dealing with potentially fraudulent data, it seems crucial to offer a multiple line screening. As stated by Haldane (“second order faking”), when data are fabricated to pass certain statistical tests, they are likely to fail on others [20]. Second, the series of analyses were illustrated by applying it to a single set of data [10]. A complete validation process would be needed in the near future by applying it, in a blinded manner, to a series of manuscripts. Third, the step based on parametric bootstrap may somehow be compared to retrospective power analysis, which is highly controversial [21].
Conclusions
There is increasing concern that in modern research, false findings may be the majority of published research claims [4]. This can be view as the result of different types of fraud, or, more commonly, of inappropriate use of standard statistical tools. Simple evaluations of the reported data as those reported in the present note might offer some warning signals. Because the described methods are quite general, the presence of such warning signals should prompt asking for raw data, in order to appraise critically the quality and the validity of the data, using additional more specific tools, shaped to explore both distribution and reporting of a given variable [6]. For the future, we should concentrate our efforts to organize systematic publication of the raw data, in order to allow independent validation of the results [22–24]. Until systematic raw data access becomes the rule, the proposed tools could serve as a “lecture schedule” for both readers and reviewers.
Key messages

Poorquality evidence and fraud are upcoming concerns in medical research

Reporting guidelines should be strictly followed and imposed by medical journals

Guidelines should be provided to reviewers to offer homogeneous evaluation of some reporting key points

Similar simple screening tools should be available for the readers.
Declarations
Authors’ Affiliations
References
 Cabana MD, Rand CS, Powe NR, Wu AW, Wilson MH, Abboud PA, Rubin HR: Why don’t physicians follow clinical practice guidelines? A framework for improvement. JAMA 1999, 282: 1458–1465. 10.1001/jama.282.15.1458View ArticlePubMedGoogle Scholar
 Grol R, Grimshaw J: From best evidence to best practice: effective implementation of change in patients’ care. Lancet 2003, 362: 1225–1230. 10.1016/S01406736(03)145461View ArticlePubMedGoogle Scholar
 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D, Schulz KF, Simel D, Stroup DF: Improving the quality of reporting of randomized controlled trials: the CONSORT statement. JAMA 1996, 276: 637–639. 10.1001/jama.1996.03540080059030View ArticlePubMedGoogle Scholar
 Huić M, Marušić M, Marušić A: Completeness and changes in registered data and reporting bias of randomized controlled trials in ICMJE journals after trial registration policy. Plos One 2011, 6: e25258. 10.1371/journal.pone.0025258PubMed CentralView ArticlePubMedGoogle Scholar
 Mathieu S, Boutron I, Moher D, Altman DG, Ravaud P: Comparison of registered and published primary outcomes in randomized controlled trials. JAMA 2009, 302: 977–984. 10.1001/jama.2009.1242View ArticlePubMedGoogle Scholar
 Buyse M, George SL, Evans S, Geller NL, Ranstam J, Scherrer B, Lesaffre E, Murray G, Edler L, Hutton J, Colton T, Lachenbruch P, Verma BL: The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Stat Med 1999, 18: 3435–3451. 10.1002/(SICI)10970258(19991230)18:24<3435::AIDSIM365>3.0.CO;2OView ArticlePubMedGoogle Scholar
 Ioannidis JPA: Why most published research findings are false. Plos Med 2005, 2: e124. 10.1371/journal.pmed.0020124PubMed CentralView ArticlePubMedGoogle Scholar
 Weir C, Murray G: Fraud in clinical trials. Significance 2011, 8: 164–168. 10.1111/j.17409713.2011.00521.xView ArticleGoogle Scholar
 Pirracchio R, RescheRigon M, Chevret S, Journois D: Simple screening statistical tools to detect reporting bias: when should we ask for raw data? Crit Care 2013, 17: 142. 10.1186/cc12611PubMed CentralView ArticlePubMedGoogle Scholar
 Boldt J, Suttner S, Brosch C, Lehmann A, Röhm K, Mengistu A: Cardiopulmonary bypass priming using a high dose of a balanced hydroxyethyl starch versus an albuminbased priming strategy. Anesth Analg 2009, 109: 1752–1762. RETRACTED 10.1213/ANE.0b013e3181b5a24bView ArticlePubMedGoogle Scholar
 Shafer SL: Notice of retraction. Anesth Analg 2010, 111: 1567. 10.1213/ANE.0b013e3182040b99View ArticlePubMedGoogle Scholar
 Qualls M, Pallin DJ, Schuur JD: Parametric versus nonparametric statistical tests: the length of stay example. Acad Emerg Med 2010, 17: 1113–1121. 10.1111/j.15532712.2010.00874.xView ArticlePubMedGoogle Scholar
 Harris EK, DeMets DL: Estimation of normal ranges and cumulative proportions by transforming observed distributions to gaussian form. Clin Chem 1972, 18: 605–612.PubMedGoogle Scholar
 Sellke T, Bayarri MJ, Berger JO: Calibration of p values for testing precise null hypotheses. Am Stat 2001, 55: 62–71. 10.1198/000313001300339950View ArticleGoogle Scholar
 Pocock SJ: Clinical trials with multiple outcomes: a statistical perspective on their design, analysis, and interpretation. Control Clin Trials 1997, 18: 530–545. 10.1016/S01972456(97)000081View ArticlePubMedGoogle Scholar
 Efron B, Tibshirani RJ: An introduction to the bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Boca Raton: Florida; 1994.Google Scholar
 Wicherts JM, Bakker M, Molenaar D: Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. Plos One 2011, 6: e26828. 10.1371/journal.pone.0026828PubMed CentralView ArticlePubMedGoogle Scholar
 Benford F: The law of anomalous numbers. Proc Am Philos Soc 1938, 78: 551–572.Google Scholar
 Masicampo EJ, Lalande DR: A peculiar prevalence of p values just below .05. Q J Exp Psychol 2012, 65: 2271–2279. 10.1080/17470218.2012.711335View ArticleGoogle Scholar
 Haldane J: The faking of genetic results. Eureka 1948, 6: 21–28.Google Scholar
 Cohen J: Statistical power analysis for the behavioral sciences. Routledge Academic. Associates LE, editor: Hilldale, New Jersey; 1988.Google Scholar
 Kirwan JR: Making original data from clinical studies available for alternative analysis. J Rheumatol 1997, 24: 822–825.PubMedGoogle Scholar
 Hutchon DJ: Publishing raw data and real time statistical analysis on ejournals. BMJ 2001, 322: 530. 10.1136/bmj.322.7285.530PubMed CentralView ArticlePubMedGoogle Scholar
 Smith R, Roberts I: Patient safety requires a new way to publish clinical trials. Plos Clin Trials 2006, 1: e6. 10.1371/journal.pctr.0010006PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.