Subtle, hidden moderators

Many Labs 2 has reopened a discussion about subtle moderators, and well, their relative non-existence.

Here’s the same ‘there are no subtle, hidden moderators’ discovery from the 1980s.

(An excerpt from my 2005 PhD dissertation. I’m not so confident about the “But the effect sizes in most studies were not inconsistent; the apparent inconsistency was only in the statistical significance of studies.” There may have been differences in ES magnitude, I haven’t gone back to check, but I do believe it is true that the vast majority were in the same direction.)

The theory of situtation-specific validity in employment tests

In 1996 Frank Schmidt wrote “reliance on statistical significance testing…has systematically retarded the growth of cumulative knowledge in psychology” (p.115). As evidence he offered a series of meta-analyses he, Jack Hunter and others had done throughout the 1970s and early 1980s. ISchmidt, Hunter and their collaborators explained that they had originally set out to “empirically test one of the orthodox doctrines of personnel psychology: the belief in the situational specificity of employment tests validities.” (Pearlman, Schmidt & Hunter, 1980, p. 373).

In this case, the employment tests refer to professionally developed cognitive ability and aptitude tests designed to predict job performance. And the “orthodox doctrine” is the theory of situational specificity, which held that the correlation between test score and job performance did not have general validity. “A test valid for a job in one organization or setting may be invalid for the same job in another organization or setting” (Schmidt & Hunter, 1981, p.1132).

The theory also held that this was the case even when jobs appeared to be superficially very similar. For (a fictional) example, a test might be a good predictor of job performance for an information service operator at a telecommunications company in Melbourne, but not for the same company in Sydney. This might seem strange at first, but it is not implausible–subtle differences in the clientele, training structure or supervisorial style could plausibly have an impact on job performance. A difficult supervisor at one branch might require successful staff at that branch to have better developed conflict resolution skills and so on. Any such subtle differences could seriously challenge a predictive test’s claim to general validity. So the theory held that the validity of the tests depended on more than just the listed tasks in a given position description—it depended on the cognitive information processing and problem solving demands of the workplace.

Where did this theory of situational specificity come from? How was it motivated? The theory of situational specificity grew out of the empirical ‘fact’ that considerable variability was observed from study to study, even when the jobs and/or tests were very similar. The theory was empirically driven; its purpose was to explain the variability, or inconsistency, of empirical results. But the effect sizes in most studies were not inconsistent; the apparent inconsistency was only in the statistical significance of studies.

For instance, imagine study 1 found a particular test to be a statistically significant predictor of job performance at location A; in contrast, study 2 found the same test was not a statistically significant predictor of job performance at location B. The purpose of the theory of situational specificity was to explain the inconsistency in the statistical significance of empirical results, by generating potential moderating variables. One obvious factor that also explained why one study found a statistically significant result, and another study did not, was the relative statistical power of the studies. But this went unnoticed for several decades. The theory of situational specificity grew structurally complex, with addition of many potential moderating variables. In fact, the search for such moderating variables became the main business of industrial or organisational psychology for decades despite the fact that the variability that they had been developed to explain was illusory. In their meta-analyses Hunter, Schmidt and their colleagues demonstrated that the difference in allegedly inconsistent results could be exclusively accounted for by the relative statistical power of the studies. The reporting of individual results as ‘significant’ or ‘non-significant’ had created the illusion of inconsistency, even though almost all obtained effect sizes were in the same direction.

“…if the true validity for a given test is constant at .45 in a series of jobs…and if sample size is 68 (the median over 406 published validity studies…) then the test will be reported to be valid 54% of the time and invalid 46% of the time (two tailed test, p=.05). This is the kind of variability that was the basis for theory of situation-specific validity” (Schmidt & Hunter, 1981, p. 1132).

How long did organisational psychology pursue this misdirected theory and its associated research program? In 1981, towards the end of their meta-analysis series, Hunter and Schmidt wrote: “the real meaning of 70 years of cumulative research on employment testing was not apparent [until now]” (p.1134).

Of the use of NHST in this program they wrote: “The use of significance tests within individual studies only clouded discussion because narrative reviewers falsely believed that significance tests could be relied on to give correct decisions about single studies” (p.1134).

The case of the Theory of Situation-Specific Validity provides us with at least evidence that NHST, as it is typically used with little regard for statistical power and over-reliance on dichotomous decisions, can damage the progress of science—that it can lead a research program widely astray. Whether or not the theory itself is actually true is irrelevant to the argument here. The ‘damage’ is that years of empirical data were seen to support the theory, when in fact they did not.

Hunter and Schmidt also hint at another, perhaps more disturbing, level of damage: “Tests have been used in making employment decisions in the United States for over 50 years… In the middle and late 1960s certain theories about aptitude and ability tests formed the basis for most discussion of employee selection issues, and in part, the basis for practice in personnel psychology… We now have… evidence… that the earlier theories were false.” (1981, p.1128-9).

Schmidt and Hunter (1998) provide  more detail about “practical and theoretical implications of 85 years of research findings” (p. 262).



Posted in Uncategorized | Leave a comment

Interviews 1: Paul Meehl (2002)

During my (relatively ancient*) PhD, I interviewed many prominent (at the time) critics of Null Hypothesis Testing and advocates of methodological and statistical change in psychology, medicine and ecology. My interview with Paul Meehl took place in 2002. (He died in Feb 2003, which makes me feel extremely lucky and grateful, to both him and his wife, Leslie Yonce.) This is not the entire transcript, but bits that I’ve been thinking about while at SIPS 2017. Over coming months, I’m going to make an effort to post bits of other interview transcripts here. (Shockingly, my recordings are on cassette tapes.)

On falsification

Everyone thinks I’m a Popperian; I want people to know I’m not. I’ve tried to explain this… (personal communication, August 2002)

On the persistence of NHST and inertia

…plain psychic inertia is a powerful factor in science, as it is in other areas of life—don’t underestimate it.  When the issue is method, rather than substance, it makes it worse.  If one has been thinking in a certain way since he was a senior in college ‘the way you test theories in psychology is refute H null’, there is a certain intellectual violence involved in telling a person, well, not that they’ve been a crook, but that they’ve been deceiving themselves (personal communication, August 2002).

On the 1999 APA Task Force on Statistical Inference report

I’ll tell you a story that might interest you.  It is a sad commentary on our profession.

The Task Force appointed four outside consultants:  Cronbach, Tukey, Mosteller and Meehl.  In my letter of acceptance [to be a consultant] I wrote about NHST and the difference between a substantive theory and a statistical hypothesis.  I said the ‘logical problem of inductive inference is bigger than the mathematical problems being debated, like how you best compute the power for example.’

The first draft [of the TFSI report] had nothing in it of what I had said.

So, I wrote another note, and reminded them of my first note.  I said ‘if what I had to say on this question is all baloney seems to me you might want to tell me what is the matter with it.’

There was no response.  The second draft had nothing [related to my comments].

Then, the quasi-final draft arrived, still no reference to anything I had said.

Finally, in my last letter, I was slightly irritated—I don’t have a real fragile ego so I wasn’t enraged, but I was hurt—I asked ‘I wonder why you appointed expert, outside consultants, if you won’t pay any attention to their input.’

Still no response!  It is somewhat discourteous:  You appoint somebody as an outside advisor and they put in the work.  I don’t even know whether the chairman of the committee even circulated my stuff.

When I read the final report, most of the things were very obvious and trivial and should have been in there.  For example, tell [the reader] whether you’ve got this population and tell whether people dropped out.  Of course, I agree with all that.  But on the hardest part of it, the whole problem of inductive inference in this context, what your general view of theory testing is, the philosophical aspects—they were practically missing.  You would think that philosophers of science didn’t exist!  (personal communication, August, 2002)

* 2006 is not really that long ago. But at SIPS it feels like it. Also, I’d argue quite violently against many of the claims I made in dissertation now, and against the language I used to make them. It really has been a *very* long decade.

Posted in Uncategorized | Leave a comment

Improving and evaluating reasoning

In January 2017, we commenced work on the SWARM project, which is the University of Melbourne’s team in the larger CREATE program – Crowdsourcing Evidence, Argumentation, Thinking and Evaluation- aims to find ways to improve reasoning by taking advantage of (a) the wisdom of crowds, and (b) structured analytical techniques, implemented in cloud-based systems.

Four research teams have been selected to participate in a number of rounds, in which the systems they create will be tested by an independent evaluation group. The analyses produced by test crowds using the systems will be evaluated in a number of dimensions: 

  • Do they make correct or accurate judgements?

  • How rigorous and comprehensive is their analysis?

  • How clearly is the analysis communicated?

  • How user-friendly is the system?

You can read more about the project or listen to some interviews about it. If you’d like to stay informed about the project, sign up here.

SWARM is led by Prof Mark Burgman, Tim Van Gelder, Fiona Fidler and Richard de Rozario.

Posted in Uncategorized | Leave a comment

New publication: Meta-research for evaluating reproducibility in ecology and evolution

screen-shot-2017-02-08-at-8-41-27-pmOver the last few years we have learned a lot about the reliability of scientific evidence in a range of fields through large scale ‘meta-research’ projects. Such projects take a scientific approach to studying science itself, and can help shed light on whether science is progressing in the cumulative fashion we might hope for.

One well known meta-research example is The Reproducibility Project in Psychology. A group of 270 psychological scientists embarked on a worldwide collaboration to undertake a full direct replication of  100 published studies, in order to test the average reliability of findings. Results showed over half of those 100 replications failed to produce the same results as the original. Similar studies have been conducted in other fields too—biomedicine, economics —with equally disappointing results.

It’s tempting to think that this kind of replication happens all the time. But it doesn’t. Studies of other disciplines tell us that only 1 in every 1,000 papers published is a true direct replication of previous research. The vast majority of published findings never face the challenge of replication.

As yet, there have not been any meta-research projects in ecology and evolution, so we don’t know whether the same low reproducibility rates plague our own discipline. In fact, it’s not just that the meta-research hasn’t been done yet, it is quite unlikely to ever happen, at least in the form of direct replication discussed above. This is because the spatial and temporal dependencies of ecological processes, the long time frames and other intrinsic features make direct replication attempts difficult at best, and often impossible.

But there are real reasons to be concerned about what that meta-research would show, if it was possible. The aspects of the scientific culture and practice that have been identified as direct causes of the reproducibility crisis in other disciplines exist in ecology and evolution too. For example, there’s a strong bias towards only publishing novel, original research which automatically pushes replication studies out of the publication cycle. The pragmatic difficulties of experimental and field research mean that the statistical power of those studies is often low, and yet there are a disproportionate number of ‘positive’ or ‘significant’ studies in the literature—another kind of publication bias towards ‘significant’ results. The rate of full data and material sharing in many journals is still low, despite this being one of the easiest and most obvious solutions to reproducibility problems.

In our paper, we argue that the pragmatic difficulties with direct replication projects shouldn’t scare ecologists and evolutionary biologists off the idea of meta-research projects altogether. We discuss other approaches that could be used for replicating ecological research. We also propose several specific projects that could serve as ‘proxies’ or indicator measures of the likely reproducibility of the ecological evidence base. Finally, we argue that it’s particularly important for the discipline to take measure to safe guard against the known causes of reproducibility problems, in order to maintain public confidence in the discipline, and the important evidence base it provides for important environmental and conservation decisions.

Paper citation:

Fidler F., Chee Y.E., Wintle B.C., Burgman M.A., McCarthy M.A., Gordon A. (2017) Meta-research for Evaluating Reproducibility in Ecology and Evolution. Bioscience. doi: 10.1093/biosci/biw159l available at

Posted in Uncategorized | Leave a comment

The role of scientists in public debate

facts-speakMonday Feb 6 2017, 9am – 5pm, Storey Hall, RMIT University

A one-day workshop for graduate students and early to mid career scientists in conservation and environmental research areas, who are interested in public engagement for practical and/or philosophical reasons. RSVP:

What are the bounds of being a scientist, and how will I know if I overstep them? Is advocacy at odds with being a good scientist? What is the public’s perception of scientists, and how do they react to scientists who break the ‘honest broker’ model of engagement? Do we simply need more knowledge brokers and NGOs—is it unreasonable to expect scientists to be involved in public debate, as well as their day job? How is objectivity maintained in science, if scientists are people with values? 

We’re here to help with these questions! Dr Kristian Camilleri (History and Philosophy of Science, HPS); Associate Professor Fiona Fidler (BioScience|HPS); Dr Darrin Durant (HPS); The HPS Postgraduate Society; Dr Jenny Martin (BioScience); Dr Georgia Garrard (RMIT, Interdisciplinary Conservation Science); Associate Professor Sarah Bekessy (RMIT, Interdisciplinary Conservation Science). We’ll also have a panel of media experts to take questions on the day.

Public engagement is something strongly encouraged by most universities, and there are many existing resources for effective science communication. However, most focus on expert information provision, where a scientist has some new knowledge that they wish to communicate to the public. Engagement advice typically focuses on news-style science communication; it less often deals with other forms of engagement, such as entering public debates or speaking out for or against new policy proposals. In those cases, the advice scientists receive often amounts to ‘separate the facts from your own personal values’, and ‘don’t speak outside your direct domain of expertise’. In practice, most scientists don’t know how to interpret that advice, or implicitly understand that it is impossible to follow. Underdeveloped guidelines, sometimes coupled with warnings from colleagues who have bad prior experiences, can be enough for scientists to withdraw from public engagement. We’d like to talk about that…

In this workshop we have two main goals. First, we want to find out from scientists, in their own words, what the dilemmas they encounter when contemplating engagement. Do scientists worry about their scientific credibility in the eyes of their peers, or the public, or both if they take a position in public debate on policy issues? Is it beyond the scope of their role of scientist to do this? These are thorny issues that we’ll tackle in a focus group style discussion (structured elicitation exercise) in the first session of workshop.

Second, we aim connect scientists with relevant expertise in philosophy and sociology of science, to help unpack some of the deeper conceptual issues underlying those dilemmas. We will explore questions like: How is objectivity maintained in science, if scientists are people with values?  What is the public’s perception of scientists, and how do they react to scientists who break the ‘honest broker’ or ‘information provision only’ model of engagement? After exploring these questions in the workshop, we will also discuss how to set up longer term peer-to-peer networks and online resources that take can take our workshop discussions to a broader audience.

Workshop program

9am                Intro (Fiona, Sarah)

9:15am           Background to our interest in engagement (Georgia)

9:30am           What are the dilemmas scientists face when contemplating engaging in debate and/or policy advocacy? Semi-structured elicitation exercise. (Fiona, Georgia, Sarah)


10:30am        Legitimate values in science, objectivity and the value-free ideal. Seminar, with Q&A. (Kristian)

11:30pm        Public perceptions: what the public expects of experts? Seminar, with Q&A. (Darrin)

12:30pm        LUNCH

1:15pm          Media interactions. Panel discussion with media experts.

2pm                Follow-up session on this morning’s elicitation exercise. What issues remain outstanding? What haven’t we addressed in our previous sessions? How else can philosophy and sociology of science help with these dilemmas? Discussion. (All)

3pm                Philosophy of Sc engagement network building. Discussion. (HPS postgrads)


3:30pm          Science engagement support (Jenny)

4pm                Workshop evaluation (Fiona)

Please contact Fiona Fidler ( for more information.


This workshop is supported by a University of Melbourne Engagement Grant.

Posted in Uncategorized | Leave a comment