Dawn of p hacking
Simmons lost touch with Cuddy, who was by then teaching at Northwestern. He remained close to Nelson, who had befriended a behavioral scientist, also a skeptic, Uri Simonsohn. Nelson and Simonsohn kept up an email correspondence for years. They, along with Simmons, took particular umbrage when a prestigious journal accepted a paper from an emeritus professor of psychology at Cornell, Daryl Bem, who claimed that he had strong evidence for the existence of extrasensory perception. The paper struck them as the ultimate in bad-faith science. "How can something not be possible to cause something else?" Nelson says. "Oh, you reverse time, then it can't." And yet the methodology was supposedly sound. After years of debating among themselves, the three of them resolved to figure out how so many researchers were coming up with such unlikely results.
Over the course of several months of conference calls and computer simulations, the three researchers eventually determined that the enemy of science -- subjectivity -- had burrowed its way into the field's methodology more deeply than had been recognized. Typically, when researchers analyzed data, they were free to make various decisions, based on their judgment, about what data to maintain: whether it was wise, for example, to include experimental subjects whose results were really unusual or whether to exclude them; to add subjects to the sample or exclude additional subjects because of some experimental glitch. More often than not, those decisions -- always seemingly justified as a way of eliminating noise -- conveniently strengthened the findings' results. The field (hardly unique in this regard) had approved those kinds of tinkering for years, underappreciating just how powerfully they skewed the results in favor of false positives, particularly if two or three analyses were underway at the same time. The three eventually wrote about this phenomenon in a paper called "False-Positive Psychology," published in 2011. "Everyone knew it was wrong, but they thought it was wrong the way it's wrong to jaywalk," Simmons recently wrote in a paper taking stock of the field. "We decided to write 'False-Positive Psychology' when simulations revealed it was wrong the way it's wrong to rob a bank."
Simmons called those questionable research practices P-hacking, because researchers used them to lower a crucial measure of statistical significance known as the P-value. The P stands for probable, as in: How probable is it that researchers would happen to get the results they achieved -- or even more extreme ones -- if there were no phenomena, in truth, to observe? (And no systematic error.) For decades, the standard of so-called statistical significance -- also the hurdle to considering a study publishable -- has been a P-value of less than 5 percent.
To examine how easily the science could be manipulated, Simmons and Simonsohn ran a study in which they asked 20 participants their ages (and their fathers' birthdays). Half the group listened to the Beatles song "When I'm Sixty-Four"; the other listened to a control (the instrumental music "Kalimba"). Using totally standard methodology common to the field, they were able to prove that the participants who listened to the Beatles song were magically a year and a half younger than they were before they had heard the music. The subject heading of the explanation: "How Bad Can It Be? A Demonstration of Chronological Rejuvenation." It was witty, it was relatable -- everyone understood that it was a critique of the fundamental soundness of the field.
"We realized entire literatures could be false positives," Simmons says. They had collaborated with enough other researchers to recognize that the practice was widespread and counted themselves among the guilty. "I P-hacked like crazy all through my time at Princeton, and I still couldn't get interesting results," Simmons says.
The paper generated its fair share of attention, but it was not until January 2012, at a tense conference of the Society for Personality and Social Psychology in San Diego, that social psychologists began to glimpse the iceberg looming ahead -- the sliding furniture, the recriminations, the crises of conscience and finger-pointing and side-taking that would follow. At the conference, several hundred academics crowded into the room to hear Simmons and his colleagues challenge the methodology of their field. First, Leslie John, then a graduate student, now an associate professor at the Harvard School of Business, presented a survey of 2,000 social psychologists that suggested that P-hacking, as well as other questionable research practices, was common. In his presentation, Simonsohn introduced a new concept, a graph that could be used to evaluate bodies of research, using the P-values of those studies (the lower the overall P-values, the better). He called it a P-curve and suggested that it could be used, for example, to evaluate the research that a prospective job candidate submitted. To some, the implication of the combined presentations seemed clear: The field was rotten with the practice, and egregious P-hackers should not get away with it.
Epilogue:
In 2014, Psychological Science started giving electronic badges, an extra seal of approval, to studies that made their data and methodologies publicly available and preregistered their design and analysis ahead of time, so that researchers could not fish around for a new hypothesis if they turned up some unexpected findings.