Many Labs 2 has reopened a discussion about subtle moderators, and well, their relative non-existence.
Oh, and let’s lay to rest the subtle moderators idea. If you don’t get the effect in Tulsa when it was originally found in Topeka it is probably because the effect was ephemeral, not because you wore the wrong color lab coat. https://t.co/g41kd7TY7Y
— Brent W. Roberts (@BrentWRoberts) November 19, 2018
Here’s the same ‘there are no subtle, hidden moderators’ discovery from the 1980s.
(An excerpt from my 2005 PhD dissertation. I’m not so confident about the “But the effect sizes in most studies were not inconsistent; the apparent inconsistency was only in the statistical significance of studies.” There may have been differences in ES magnitude, I haven’t gone back to check, but I do believe it is true that the vast majority were in the same direction.)
The theory of situtation-specific validity in employment tests
In 1996 Frank Schmidt wrote “reliance on statistical significance testing…has systematically retarded the growth of cumulative knowledge in psychology” (p.115). As evidence he offered a series of meta-analyses he, Jack Hunter and others had done throughout the 1970s and early 1980s. ISchmidt, Hunter and their collaborators explained that they had originally set out to “empirically test one of the orthodox doctrines of personnel psychology: the belief in the situational specificity of employment tests validities.” (Pearlman, Schmidt & Hunter, 1980, p. 373).
In this case, the employment tests refer to professionally developed cognitive ability and aptitude tests designed to predict job performance. And the “orthodox doctrine” is the theory of situational specificity, which held that the correlation between test score and job performance did not have general validity. “A test valid for a job in one organization or setting may be invalid for the same job in another organization or setting” (Schmidt & Hunter, 1981, p.1132).
The theory also held that this was the case even when jobs appeared to be superficially very similar. For (a fictional) example, a test might be a good predictor of job performance for an information service operator at a telecommunications company in Melbourne, but not for the same company in Sydney. This might seem strange at first, but it is not implausible–subtle differences in the clientele, training structure or supervisorial style could plausibly have an impact on job performance. A difficult supervisor at one branch might require successful staff at that branch to have better developed conflict resolution skills and so on. Any such subtle differences could seriously challenge a predictive test’s claim to general validity. So the theory held that the validity of the tests depended on more than just the listed tasks in a given position description—it depended on the cognitive information processing and problem solving demands of the workplace.
Where did this theory of situational specificity come from? How was it motivated? The theory of situational specificity grew out of the empirical ‘fact’ that considerable variability was observed from study to study, even when the jobs and/or tests were very similar. The theory was empirically driven; its purpose was to explain the variability, or inconsistency, of empirical results. But the effect sizes in most studies were not inconsistent; the apparent inconsistency was only in the statistical significance of studies.
For instance, imagine study 1 found a particular test to be a statistically significant predictor of job performance at location A; in contrast, study 2 found the same test was not a statistically significant predictor of job performance at location B. The purpose of the theory of situational specificity was to explain the inconsistency in the statistical significance of empirical results, by generating potential moderating variables. One obvious factor that also explained why one study found a statistically significant result, and another study did not, was the relative statistical power of the studies. But this went unnoticed for several decades. The theory of situational specificity grew structurally complex, with addition of many potential moderating variables. In fact, the search for such moderating variables became the main business of industrial or organisational psychology for decades despite the fact that the variability that they had been developed to explain was illusory. In their meta-analyses Hunter, Schmidt and their colleagues demonstrated that the difference in allegedly inconsistent results could be exclusively accounted for by the relative statistical power of the studies. The reporting of individual results as ‘significant’ or ‘non-significant’ had created the illusion of inconsistency, even though almost all obtained effect sizes were in the same direction.
“…if the true validity for a given test is constant at .45 in a series of jobs…and if sample size is 68 (the median over 406 published validity studies…) then the test will be reported to be valid 54% of the time and invalid 46% of the time (two tailed test, p=.05). This is the kind of variability that was the basis for theory of situation-specific validity” (Schmidt & Hunter, 1981, p. 1132).
How long did organisational psychology pursue this misdirected theory and its associated research program? In 1981, towards the end of their meta-analysis series, Hunter and Schmidt wrote: “the real meaning of 70 years of cumulative research on employment testing was not apparent [until now]” (p.1134).
Of the use of NHST in this program they wrote: “The use of significance tests within individual studies only clouded discussion because narrative reviewers falsely believed that significance tests could be relied on to give correct decisions about single studies” (p.1134).
The case of the Theory of Situation-Specific Validity provides us with at least evidence that NHST, as it is typically used with little regard for statistical power and over-reliance on dichotomous decisions, can damage the progress of science—that it can lead a research program widely astray. Whether or not the theory itself is actually true is irrelevant to the argument here. The ‘damage’ is that years of empirical data were seen to support the theory, when in fact they did not.
Hunter and Schmidt also hint at another, perhaps more disturbing, level of damage: “Tests have been used in making employment decisions in the United States for over 50 years… In the middle and late 1960s certain theories about aptitude and ability tests formed the basis for most discussion of employee selection issues, and in part, the basis for practice in personnel psychology… We now have… evidence… that the earlier theories were false.” (1981, p.1128-9).
Schmidt and Hunter (1998) provide more detail about “practical and theoretical implications of 85 years of research findings” (p. 262).