"Why Science is Never Settled" – Part Two

by Tedd Roberts

In Part One of this essay, we explored the basics of the Scientific Method and demonstrated how various scientific fields are constantly changing their notion of settled science as theories and hypotheses are refined and technology for testing those hypotheses improves. In Part Two, we now move on to examine the ways in which errors, miscalculations and deliberate deceptions contribute to upsetting settled science.

* * *

Scientists are human, too

Yes, scientists are human, with all of the faults and foibles that implies. The example of the "Bone Wars" between paleontologists Cope and Marsh should tell us that. While popular culture prefers to paint Galileo as persecuted by the Church for his science -- indeed, consequently founding a counter-religious illuminati of scientists -- careful study of history reveals that Galileo was not "persecuted" for his beliefs, but rather he was sanctioned by Rome for his personal actions in defiance of a church order of which he was a member. We certainly have plenty of parallels today in which it is easy to point to scientists whose behavior casts a shadow on their own work. Of course, there are a few factors which tend to assist the process of self-destruction.

The problem of "Publish-or-Perish"

The essential currency of an academic scientist consists of two items: how many papers they publish, and how well-funded their research. While many scientists would love to have a job where all they needed to do was conduct experiments with no obligation to fight for promotion and funding, the simple truth is that any job must be evaluated by some form of a performance metric. Within most scientific jobs, that metric consists of having other scientists evaluate your work and pronounce that it is good. Typically, this consists of writing up results (and conclusions), submitting them to a scientific journal, obtaining a favorable review by peers, and then having that paper published in a journal where others can read it. Most of the evaluation of "worth" comes from the peer-review process (and more on that later), since, once published, any confirmation or refutation of the experimental results must take the form of letters to the editor, or new papers which agree or disagree with the published results. Letters to the editor are in fact very rare in science -- not that they are there, but that the number of letters compared to the number of published papers is really very small (not all journals accept letters, and even then, there may be 1-2 per issue, while the number of new papers is often 20-50 per issue).

Studies which produce results and conclusions counter to those always published must overcome the prior results in both numbers (how many published papers cite the result or the refutation) and the "Impact Factor" of the journal in which the study appears. Much as certain newspapers have reputations based on circulation and the type and number of articles they print, scientific journals have a similar ranking system based on a weighted ratio of the number of subscribers divided by the number of citations of the articles they publish per year. Thus, it is not just how many articles are published, but how many are read and subsequently cited by other authors (in other papers and journals). This ratio gives a sense of the relative impact that a journal has compared to others in its field. Thus an article in Science or Nature has 2-5 times the impact factor of an article in Journal of Neuroscience, and 10-20 times the impact factor of an article in an open-access, open-review journal such as Frontiers in Neural Science. Any countervailing research published in a lesser impact journal is much like a battle of King of the Hill, and requires either repeated publication or getting the countervailing results into a similar high-impact journal.

Scientists are thus rated on the number of papers they publish, the impact factor of the journals in which they publish, and indirectly (via impact factor) how often their research is cited. At research foundations, corporate laboratories and government agencies, publication is the main tool used to assess the productivity of a scientist. In academia, there is one other factor: funding. Within colleges, universities and medical schools, some or all of a professor's salary may be "hard money" (paid by the institution) and some "soft money" (paid from research grants). In fact, almost all salaries in research below the Professor level are paid to a large degree from the research funds and the greater the scope of research project, the more funding sources required to support it. Research grants go through a peer review process similar to research publications; therefore the number of funded grants is also used as an evaluation tool for salaries and promotion of academic scientists.

Thus we come to the publish or perish dictum. The requirement for published papers and grants varies from place-to-place and also depends on the scientist's status: full professors with tenure need not worry as much as assistant professors without tenure. I have seen one recommendation that an academic scientist should publish one paper per year per person in the lab. Thus, a lab consisting of the investigator, one post-doc, two students and a full-time technician would want to publish 5 papers (or 4 papers plus one new, funded research grant) per year; with more credit given for high-impact articles. Personally, I feel that a lab which tries to publish so much runs a grave risk of errors, although it is true that larger labs have more opportunity to publish than small labs, so perhaps some variation on the rule is appropriate. My own preference has long been for one-two moderately placed (in terms of impact factor) papers per year, plus presentations at two scientific meetings a year. Still, given that it takes several weeks of writing just to produce one paper, plus time for the reviews and revision, publish or perish soon involves more time writing than researching, and all too often, rushing results into print before they are fully analyzed.

Accidents occur, and scientists are not immune from them. Hopefully, errors are caught in the review process; it has certainly happened to me, and I've caught many errors as a reviewer. Too much pressure to publish too often (or simply rushing the process), can lead errors that must later be corrected, either through published retraction, or simply by other lab(s) finding and reporting to differing results. No scientist truly wants to get a result published, and then find out later that the results were not valid due to a decimal point error in statistics... except when the errors are deliberate...

Lies, Damned Lies, and Statistics!

Since scientists are human, there is always the chance that one of them (us!) will deliberately manipulate data and or results -- particularly through the use of statistics. Those readers with a background in statistics will know that the primary use of statistical tests is to determine whether two sets of observation are different when it is difficult to determine by other means. Certainly, some scientific results can be clearly determined without statistics. For example: "Roses are red, violets are blue..." embodies an observation with a clear difference between the two cases. Ah, but it is not always so simple. Violets are, in fact a shade of purple, although statistically speaking, if one were to measure the color hues of violets, it might be shown to overlap with blue. As scientists, we would phrase the statistical question as follows: Does the color variation of the population of violets include the color blue? We further qualify the question to: Is it likely that if 95% of all violets in existence (also known as the "population" of violets), they would include the color blue? This, then, it what is known as a "P<0.05", or a probability limit of 5%. In other words, does the presence of the color blue fall within 95% of the population of violets, or with the "outliers" comprising the other 5%? Statistical comparison provides a means of answering that question of whether two conditions can produce the same result given the normal, random variability of natural systems.

Now of course, when we get to roses, the situation is very different. While violets, by definition, have a fairly limited color palette, roses have a very broad palette -- from white to black, and nearly all colors in between. Thus in answer to the question: "Are roses red?" We can say yes. Likewise with: "Are violets blue?" "Yes." Now we can also look at the additional question: "Are violets red" and, with a statistical likelihood of at least 5%, answer "No, violets are not red (P<0.05)." In other words, we accept the hypothesis: Violets are not red. But, "Are roses blue?" Here we have a problem, because the 95% population of rose colors does include the color blue, and we reject the hypothesis: Roses are not blue.

Thus we come to the crux of the problem –- even without deliberate malfeasance, it is all too easy to misrepresent the results of scientific experiments when the statistical tests give ambiguous results. From a scientific standpoint, Roses are red (among other colors) and violets are blue (to within a couple of shades); in addition, violets are not red (P<0.05), but roses are not "not-blue." Note that these "statistical" results are dependent on how thoroughly the scientist samples the populations for their test. If we sampled only American Beauty roses, then indeed, roses would be red and not blue, and our statistical confirmation would be valid, but only for the population of flowers we sampled.

A Lack of (Statistical) Power

A paper on this very issue of statistical misuse entitled: "Power failure: why small sample size undermines the reliability of neuroscience," by Katherine S. Button et al. appeared in the April 10, 2013 issue of Nature Reviews, Neuroscience, a quite respectable journal featuring reviews in the field of neuroscience. This report reviewed the statistical tests reported in Neuroscience research papers from 2011, and concludes that in the data they sampled, the statistical tests from those studies were very likely to either accept a hypothesis as true –- when it was not -– or miss confirming a true hypothesis.

The Nature Reviews article started with a literature search for meta-analyses in Neuroscience published in 2011, and found 246 articles that used "meta-analyses" -- essentially combining the data from many prior papers and re-analyzing those larger data sets for observations that cannot be seen in small data sets. A meta-analysis thus looks at data gathered and reported across many primary publications -- original reports from a single lab. The authors then sorted through those 243 articles for ones that provided enough information on the original data to allow calculation of statistical power in about 40 of those papers. Power functions uses the mean, or average, for a population, a measure of variability, and also determined how large of a difference can be reliably detected given a limited sample size. In essence, it is a way of predicting whether a statistical test is, itself, valid. The Statistical Power Function is the foundation of experimental design, and is the basis for justifying how many subjects to test, and what is considered a statistically significant result.

Button et al. concluded that many of the meta-analyses papers did not have high functions of statistical power, and risked incorrect interpretations of the statistical comparisons. However –- and this is very important –- that conclusion did not apply this conclusion to the field of Neuroscience as a whole. In short, headlines from April 2013 implying that the study condemns an entire field of science, are false. In perspective, this article in Nature Reviews Neuroscience sounds a cautionary note regarding the need for better statistical planning in meta-analysis. What the article does not do is state that all or even many Neuroscience articles have the same flaw. In particular, given that this caution applies to making unwarranted conclusions (and affects our notion of settled science), it again points out that fact that scientific discovery is an ongoing process, and the very announcement of settled conclusions sets the research up for scrutiny and critique. It behooves us all to avoid the danger of schadenfreude by using the result outside the scope of the study thereby committing the exact same error pointed out in the paper. Meanwhile, there are other factors at work which point out the good and bad with respect to scientific research, but frankly, misinterpretation of statistics pales in comparison to deliberate deception.

The vaccine controversy

One of the more famous incidents of scientific malfeasance involved a study from 1998 which showed a link between the measles vaccine and autism. The data showing a causal link was taken from autistic children who had received the "MMR" vaccine to prevent measles. However, when the study could not be duplicated, investigation revealed that data was taken from just 12 out of over 200 children available for the study. This procedure is often called "cherry-picking" and is used to ensure small variability within a set of data so that any statistical test come out exactly in a manner predetermined by the experimenter. On this basis alone, the study was invalidated and a retraction printed in The Lancet, a high-impact factor journal for medical research.

This could have been ruled a mistake, given that experimenters may "cherry-pick" data from human patients given the ethical concerns of withholding a beneficial treatment, thus impairing the ability to establish strict controls on the study. Under the circumstances, it may have been necessary to severely limit the study population if there were various other factors which could have contributed to the effects and confused the findings. [Incidentally, this accusation is often raised against other correlative health studies: smoking, diet, cholesterol, gluten, etc.] Unfortunately, this case was not about simple misuse of statistics, for there was a deeper thread of malfeasance. The lead author of the study, Dr. Andrew Wakefield, had filed a patent for an alternate to the MMR vaccine. Furthermore, a researcher working under Wakefield's supervision reported that there was no measles virus present (hence no effect of the MMR vaccine) in the children used for the initial study; while a former graduate student testified in court proceedings that Wakefield ignored data which did not fit his hypothesis (that MMR vaccine was linked to autism).

In the aftermath, Wakefield resigned his hospital job in the United Kingdom, but was later censured by the UK Medical Research Council and banned from was fired from his medical and research positions, barred from practicing medicine in the U.K. He has since moved to the U.S., and despite admitting to the improper study, is still active in promoting the link between MMR vaccine and autism. In a strange turnabout to the notion of settled science, Wakefield's supporters accuse the medical authorities of U.S. and U.K. of the dogmatic approach and failing to acknowledge a link between vaccines and various diseases and disorders.

If data can be "cherry-picked," statistics can be misused, and hypotheses incorrectly rejected or confirmed, what are the protections against scientific malfeasance? What guarantee is there that a scientific report is valid, even if it goes against the conventional wisdom of the field? At the same time, how do we tell if a study is making false claims? The answer is, or should be, peer-review of scientific papers and proposals. A panel of other scientists reads any submitted paper or grant proposal, reviews the science for validity -- and recommends acceptance or rejection. At least, that's the way it is supposed to work, but the process of peer-review has its faults.

The problems of peer-review

[Disclaimer: I am a "peer-reviewer." I have over 100 scientific primary publications to my credit, and have been asked to review scientific articles since 1989 and NIH/NSF grant applications since 1998. Thus my opinion is shaped by 25 years in science participating in, and at the mercy of, peer review; and my typical workload consists of request to review about 12-18 papers and 5-10 grants per year. While this is what some my field might consider extensive experience with peer-review, it is also fairly limited in that it is only within my field, and only with respect to research papers and grants. By the way, my field required me to be the victim (excuse me – the recipient) of peer-review for many years (7 in my case) before becoming a reviewer.]

What drives doubts about the effectiveness of peer-review? Here are some examples:

While I do acknowledge that there are some merits to the points addressed above, I don't believe that peer review is broken per se, but I do agree that the scientific community as a whole needs to police it better. Violation of public trust by manipulating the system of peer review is an egregious act. Sadly it is not unusual for a "good-old-boy" network to operate in science. First there is the very nature of finding the peers to review the paper. When a paper is submitted to a journal, the authors provide a list of names of scientists (peers) who have expertise in the field and should be able to judge the work on its scientific merits. There is often a second list of person known or suspected to be biased by virtue of a conflict of interest. Editors (and funding agencies) are alert to even the appearance of bias either for or against the authors, but it is often the case that an editor is unfamiliar with the details of the research and must rely on those recommendations to choose the peer reviewer. Over the past year I have become an editor of a journal in my field. It is hard to find enough reviewers willing to take time out of their research to review papers. Good reviewers get heavy workloads and many requests simply because they are so good (and available). Fortunately, most scientists are aware of appearance of bias, and will be more critical of their friends than a complete stranger. I, for one, try to ensure that someone I know professionally does not get a "pass" on sloppy science, since it also reflects poorly on me. In addition, recommending only "friends" as reviewers won't work –- as an editor, I soon discovered that only one in ten of the recommended reviewers will accept an assignment, but that those turning it down will recommend someone else; thus, editors work down the list until they find two-to-five reviewers (depending on the journal.

One of the problems in peer review is the "not invented here" syndrome. An article may be very well-written, but rejected by multiple journals on the basis of "not appropriate (or too complicated) for the readership of this journal." When reviewed by scientists with traditional training within a field, such a paper may be subject to highly critical reviews or unreasonable demands for additional experimentation or controls. When that same paper is read by cross-disciplinary scientists, it may receive a much more favorable (or even enthusiastic) reception. When added to the desire to get a novel finding into print first and lay claim to a result (thus upsetting settled science) it can be very frustrating to know that an outside audience would publish in a heartbeat, while still no getting recognition from one's peers!

This, however, is where the second and fourth bullet points above interact. A new model of publishing embodied by the journal PLoS One (Public Library of Science) is an online publication that does not make value judgments on the appropriateness of an article, but will subject it an open review by 2-5 peers whose name appears with the publication. Typical review is by 2-3 outside reviewers, plus the editor, and is "blind" in that the author never knows who reviewed the paper. The philosophy of PLoS One is to let the scientific community sort it all out post-publication; with unlimited space, there can be publication of every article that passes basic peer-review; however, the scientific community will decide for itself what is worth keeping. This is not an entirely bad approach, but it still has problems: (a) the supply of reviewers is limited (see above), and (b) once released, there is no good way to retract a publication later determined to be invalid. Thus, publishing more while maintaining the peer-review process is not necessarily a winning game. What if there were a way to reduce the burden of peer review by simply publishing and letting "society" decide what results are worthwhile? If all scientific publishing were done on the internet, and anyone wanting to find a particular result just had to search for it, there would be the issue of deciding which search results to choose: The most recent or the one with the most links or comments. Simple comment count would also not be enough, since those comments could entail a running argument of the pros and cons of the scientific paper. If we institute a judgment of worth or a vote on the acceptability of a scientific paper, we risk turning Science into a popularity contest. Consider also the Wikipedia model: Should just anybody -- with or without formal scientific training –- be able to edit our "WikiScience?"

By far, my strongest counter to claims that peer review is broken and should be replaced (or scrapped) is that if there are no gatekeepers, then there is no way to weed out the junk science. The continuing measles/autism scandal is the perfect example of science by public acclaim; if it had been subject to greater scrutiny, it may never have been published. In addition, once it was published, it has been damnably difficulty to remove its credibility from those who choose to believe. Do I think peer-review is broken? I certainly think it has been warped; which is good for challenging any notion of settled science, but is simultaneously dangerous in allowing science to be subject to public whim.

It's a process, not a conclusion

I do not think that any portion of the scientific process should be scrapped: from hypothesis generation, to statistical analysis, to peer review. I do think it needs better watchdogs -- and those watchdogs are the scientists whose job it is to always keep in mind that their job is to continually renew the process of science, and never "settle" (pun intended) for the easy answer or the sloppy science. If a scientist witnesses abuse of the system, they should be able to speak out and not get shut out because of political whim. When they find truly novel results, or results that contradict the settled science, they need to be encouraged to publish the novelty, correct their mistakes, and avoid the trap of thinking that a result is a conclusion.

At the same time, the public needs to be better educated so that they do not get told what to do be manipulative media politicians and yes, scientists. I would be all for fully open access to science if the public were educated enough to understand the basics to be able to tell what is and is not good science. Unfortunately the reality is that there exists a high level in science where only a very few people worldwide understand or even care. Only time can judge the worth of such research, the rest requires an educated populace. As long as there is *any* stratification within the populace based on education, there will be those who must translate science to the masses, and become a gatekeeper.

Unfortunately, the gatekeeper position can all too easily be corrupted as we have seen. Any scientific conclusion which agrees with the gatekeepers is too easily labeled as a "consensus," while dissenting opinions are labeled as "fringe," "deniers," or even "fraud." The section on scientific blunders in the beginning of this essay certainly highlights the error inherent when new evidence and scientific results comes along and relegate the former consensus position to the same historic scrap heap and geocentrism. I will sometimes state that any two scientists will produce three different scientific opinions. In even the narrowest aspects of my research field –- with possibly a total of only 200 labs in the world which study the same aspect of Neuroscience –- it is difficult to get even half of them to agree on any one theory. A true consensus in the sense of agreement of >90% of scientists in that field would require so many coincidences, that are mathematically extremely rare.

Internet memes and the love of science.

As stated above, there is a real need for better public education in science. In fact, an old friend of mine just went to work for the U.S. Department of Education in a program working to improve the Science, Technology, Engineering and Mathematics (STEM) curriculum in schools throughout the country. It is a daunting job, but frankly, it is not helped by the tendency for people to latch onto internet memes such as the Facebook website with the non-PG13 name: "I F---ing Love Science." Unfortunately, this site and others like it, do more damage to the notion of real science (not to mention perpetuating the false notions of settled science and consensus). The IFLS site and the pictures they post are often the flashy, colorful end result, and are more indicative of the skill of the graphic artist than the actual science. They generally ignore the need for a deeper understanding of the Scientific Method and the sheer mind-numbing tedium of experimental testing in order to truly "love" science. While a well-coifed scientific pundit in a tweed jacket (or an engineer in rumpled lab coat) is lauded by the media, working scientists are often ignored or distrusted. Very few scientists have publicists and make-up artists; science is not more "true" because the experiments bubble menacingly, flash lights on complex equipment, or turn pretty colors. Someone stated in an online discussion the IFLS memes don't truly love science, they are merely "admiring its butt as it walks by."

I can only hope that this exploration of why science is always changing, and always refining itself, will lead to a stronger, better educated public, resistant to error and fraud. Our greatest defense against being fooled or misled –- by changing theories in science, by misinterpretation or by pseudoscience -- lies with education. A true love of science and a better understanding of the dangers of thinking that science is settled or represents a consensus starts with knowledge.

Knowledge is power. Be powerful.

* * *

Copyright © 2014 Tedd Roberts

Tedd Roberts is the pseudonym of neuroscience researcher Robert E. Hampson, Ph.D., whose cutting edge research includes work on a "Neural Prosthetic" to restore memory function following brain injury. His interest in public education and brain awareness has led him to the goal of writing accurate, yet enjoyable brain science via blogging, short fiction, and nonfiction/science articles for the SF/F community. Tedd Roberts' other nonfiction articles for Baen Books are available in the Baen Free Library at http://www.baenebooks.com/c-1-free-library.aspx, Free Nonfiction 2012, 2013, 2014.