Theodore Xenophon Barber, Medfield Foundation, Medfield, Massachusetts
Introduction
Since experiments are designed and carried out by fallible individuals, they have as many pitfalls as other human endeavors. In this text we shall discuss ten pivotal points in research where investigators and experimenters can go astray. By becoming sensitized to these pitfalls, those of us who are engaged in experimental research may be better able to avoid them in our own studies. Also, those of us who utilize research results in our teaching or practice may be able to use experimental studies more wisely if we are sensitized to the many possibilities they contain for misleading results and conclusions.
Two questions will be at the forefront of discussion: 1) At what pivotal points in the complex research process can the experimental study go astray and give rise to misleading results and conclusions? 2) What steps can researchers take to avoid, these pitfalls? To answer these questions, I shall first focus on those aspects of experimental studies that are under the control of the investigator and then on those aspects that are under the control of the experimenter. I shall begin by making a distinction between the investigator and the experimenter.
Although the investigator and the experimenter can be the same person, their roles are functionally quite different, and it is rather common in recent research to find one person in the role of investigator and another person in the role of experimenter.
The investigator decides that a study is to be conducted, how it is to be designed and carried out, and how it is to be analyzed and interpreted. Thus, the investigator is responsible for the experimental design, the procurement and training of experimenters, the overall conduct of the study, the analysis of the results, the interpretation of data, and the writing of the final research report.
The experimenter, on the other hand, is the person who conducts the study-who tests the subjects, administers the experimental procedures, and observes and records the subjects' responses. Thus, strictly speaking a person in the role of the experimenter is responsible for the collection of the data but is not responsible for the experimental design, the analysis, and interpretation of the data, or the final research report.
In brief, even though the same person may take the role of both an investigator and an experimenter, these two roles are functionally quite different. Furthermore, in much present-day research, investigators are typically highly paid professionals whereas experimenters are often graduate or undergraduate students.
Table 1 Investigator and Experimenter Effects
Investigator Effects
I. Investigator Paradigm Effect
II. Investigator Experimental Design Effect
III. Investigator Loose Procedure Effect
IV. Investigator Data Analysis Effect
V. Investigator Fudging Effect
Experimenter Effects
VI. Experimenter Personal Attributes Effect
VII. Experimenter Failure to Follow the Procedure Effect
VIII. Experimenter Misrecording Effect
IX. Experimenter Fudging Effect
X. Experimenter Unintentional Expectancy Effect
Table 1 lists ten major pitfalls in research that can directly or indirectly give rise to misleading results and conclusions. As shown in the top portion of Table 1, misleading results and conclusions in an experimental study can derive from the investigator's paradigm, from his experimental design, from the "looseness" of his experimental procedure, from his analysis of the data and, possibly, from his fudging of data. As shown in the bottom portion of Table 1, misleading results and conclusions can also be produced by the experimenter's personal attributes, by his failure to follow the experimental procedures, by his misrecording of data, by his fudging Of data, and by his expectancies. Each of these effects will be discussed in turn.
Before I turn to the discussion of each of the ten pitfalls listed in Table 1, however, let us note an important point. During recent years, the biasing effects and misleading conclusions that are associated with experimental research have been commonly attributed to the experimenter who carries out the study rather than to the investigator who designs and has the major responsibility for the study. Recent books (Adair, 1973; Friedman, 1967; Jung, 1971; A.G. Miller, 1972; Rosenthal, 1966; Rosenthal & Rosnow, 1969) which discussed the artifacts or pitfalls in research tended to focus on the experimenter and tended to neglect the important role of the investigator. I shall attempt to redress this imbalance by focusing equally on the role of the investigator and the role of the experimenter. I hope that it will become clear to the reader that the bias that has commonly been attributed to the experimenter who runs the study is at times actually due to the investigator who has major responsibility for the study.
NOTES
1. In the early 1960s, only 37 percent of 71 biologists who were interviewed by Crane (1964) reported that they collected all of their own data. I believe this trend has accelerated in both the biological and behavioral sciences and that now most investigators only rarely serve as experimenters.
2. Although this book covers ten pitfalls in behavioral science research, there are many related topics that are not discussed. These relevant problems, which are discussed in general terms in recent texts (Adair, 1973; Jung, 1971; A.G. Miler, 1972; Rosenthal & Rosnow, 1969), and which are covered in detail in the books cited below, include the following: (a) Problems of sampling bias due to the use of volunteer subjects (Rosenthal & Rosnow, 1975). (b) Problems pertaining to inter-subject communication about the experimental procedures which derive from the fact that a substantial proportion of subjects do not keep their promise "not to talk about the experiment to others" (Farrow, Farrow, Lohss, & Taub, 1975; Wuebben, Straits, & Schulman, 1974). (c) Ethical issues pertaining to lack of confidentiality and informed consent, coercion of college students to participate in psychological experiments as part of a course requirement, the use of misleading instructions or deception to influence subjects, and the application of stress to subjects (B. Barber, Lally, Makarushka, & Sullivan, 1973). (d) Problems pertaining to the researcher and the social system (science as a system of norms, the associations of scientists, the role, functions, and social status of scientists, and social factors that affect the formulation of research problems and the publication of research findings) (Sjoberg & Nett, 1968).
In Table 1 the Investigator Paradigm Effect is listed first. This effect exerts a pervasive influence on every aspect of experimental research including the results and conclusions.
Kuhn (1962, 1970) has used the term paradigm to refer to a conceptual framework and a body of assumptions, belief, and related methods and techniques that are shared by a large group of scientists at a particular time. For example, in astronomy we can refer to the Copernican (heliocentric) paradigm which differed markedly from and which gradually replaced the Ptolemaic (geocentric) paradigm, and in psychology we can refer to the behavioristic paradigm which included a conceptual framework and a related body of assumptions, beliefs, and methods that were shared by a large group of psychologists until recent years. Kuhn presented historical evidence that such paradigms set boundaries for "normal" scientific research. A paradigm provides an implicit framework for the scientists working in an area. The assumptions or presuppositions of the paradigm govern the choice of problems and the "correct" methods and criteria for evaluating the solution of such selected problems. By defining what is normal, accepted, and natural, the paradigm acts as a blinder. The rules or ways of approaching problems operate more or less automatically and the scientist believes he is doing the natural or obvious thing.
Kuhn also noted, however, that paradigms are useful in providing directions for scientific research; thus they permit intensive and focused investigations. Without an accepted paradigm research would be diffuse and lead to the accumulation of disorganized facts.
Once a paradigm is established, however, the function of scientific training is to produce highly competent problem-solvers who will work within the paradigm. The important point here is that the prevailing paradigm determines not only what questions are asked but also what kinds of data are considered relevant and how the data will be gathered, analyzed, interpreted, and related to theoretical concepts (Chaves, 1968; Spanos & Chaves, 1970).
TENACITY OF PARADIGMS AND RESISTANCE TO NEW DISCOVERIES
Although a new paradigm may very slowly and imperceptibly supplant a prevailing paradigm (Toulmin, 1970; Watkins, 1970), the history of science shows that scientists often hold on tenaciously to an accepted paradigm and vigorously fight off any challenges. In fact, Planck (1936, p. 97) argued that new paradigms and theories are rarely accepted by rational persuasion of their opponents; instead, the new paradigm is accepted only after the opponents die out. Kuhn (1962, 1970) has presented a series of examples demonstrating the tenacity of paradigms and these have been supplemented in recent years by writers such as B. Barber (1961), de Grazia (1966), and Koestler (1971). Let us glance at a few of the examples presented by the latter three authors.
B. Barber (1961) noted that, although science comprises a social system in which objectivity and openness to new ideas is usually greater than in other social institutions, nevertheless, discoveries or ideas that challenge the dominant paradigm are not readily accepted. As examples, he presented the following: resistance to Copernicus' heliocentric theory from astronomers who could not break with the traditional Ptolemaic paradigm which viewed the earth as motionless; resistance to Thomas Young's wave theory of light by the scientists of the 19th century who were faithful to the corpuscular paradigm; resistance to Mendel's conception of the separate inheritance of unit characteristics by biologists who adhered to the prevailing paradigm which postulated joint and total inheritance of biological characteristics; and resistance to Ampere's theory of magnetic currents by scientists who could not fit it into the prevailing Newtonian mechanical model.
A recent example of the strength of accepted paradigms and the furious reactions that they may arouse when challenged is the case of Immanual Velikovsky. Velikovsky challenged the prevailing astronomical paradigm which assumed that only those processes which are operating today in our solar system could have operated in earlier periods of man's recorded history. The dominant "uniformitarian" paradigm, which excluded sudden global catastrophes, was frontally attacked by Velikovsky who amassed historical, geological, paleontological, and archeological evidence indicating that, during historical times, the earth had been subject to catastrophes from Venus and Mars. This new paradigm aroused furious reactions from scientists who adhered to the traditional paradigm; these reactions included attempts to stop publication of Velikovsky's book (Worlds in Collision) and to exclude his writing from learned journals (de Grazia, 1966; Stove, 1972). Although Velikovsky's ideas are as debatable as any radically new ideas, it appears that the furious reactions were due, at least partly, to a paradigm clash (de Grazia, Juergens, & Stecehini, 1966). It is noteworthy that some of the predictions made by Velikovsky from his paradigm have been confirmed (Anonymous, 1972a; Bargmann & Motz, 1962; Stove, 1972).
Many other examples of the effects of paradigms in preventing acceptance of research findings are presented by Kuhn (1962, 1970), Koestler (1971), and others. For instance, Koestler (1971) has shown that important experiments carried out during the 1920s, which contradicted the prevailing Darwinian paradigm by supporting Lamarck's thesis of the inheritance of acquired characteristics, were viciously attacked by the scientific establishment and that no one ever tried to replicate the experiments before condemning them.
FAILING TO "SEE" EVENTS AND "SEEING" NON-EXISTENT EVENTS
Scientists at times fail to "see" events that are incongruent with the assumptions of a prevailing paradigm. An interesting example is the failure of physicists to "see" indications of the positron even though the signs were present for many years. The positron is exactly like the electron, except that it has a positive electric charge. Apparently, physicists failed to recognize the positron because they assumed that electrons were always negatively charged and that positive charges were always carried by the much heavier protons (Hanson, 1963).
Similar considerations apply to the discovery that noble gases, such as neon and argon, can combine. This finding was not made until 1962 even though any investigator could have made the observation easily, with a few hours of effort, anytime during the 40s or 50s. It appears that the discovery was not made earlier because of the strength of the traditional assumption that noble gases simply cannot combine (Abelson, 1962).
Scientists may not only miss phenomena that do not fit their assumptions, they may also "see" non-existent phenomena, such as N-rays, when the phenomena fit the prevailing assumptions. During the early years of this century, after the discovery and acceptance of X-rays, many scientists were deluded into also "seeing" N-rays (Rostand, 1960). Nearly 100 papers on N-rays were printed in a single year (1904) in the official French scientific journal Comptes Rendues (de Solla Price, 1961). However, in a letter to Nature in 1904, Robert W. Wood, professor of physics at John Hopkins University, showed that all of the effects attributed to N-rays were due to wishful thinking and to the immense difficulties involved in estimating by eye the brightness of faint objects. From then on, there were no more papers on N-rays.
PARADIGMS IN PSYCHOLOGY
Hearst (1967), Krantz (1971), and others (e.g., Harlow, 1969) have noted that present-day behaviorists who adhere to the Skinnerian or operant conditioning approach appear to share a common paradigm. The Skinnerians reject competing approaches to psychology, and do not cite or utilize the work of non-operant psychologists. Hearst (1967) noted that:
To many outsiders an operant conditioner is a hardnosed experimentalist who spends endless hours in the enthusiastic analysis of cumulative records from one or two subjects, attacks anything that sounds even mildly theoretical or physiological, ridicules anyone who has ever used statistics of the R.A. Fisher variety and ignores the work of any psychologist who does not publish in the Journal of the Experimental Analysis of Behavior. Not since J.B. Watson's time has any band of behaviorists seemed so assertive in its likes and dislikes and so convinced that its techniques and experimental approach will not only change psychology but in the process reshape the world. (p. 402)
In recent years, however, a "cognitive" paradigm has begun to compete with the behavioristic paradigm for the allegiance of experimental psychologists. As Katahn and Koplin (1968) pointed out, the behavioristic paradigm emphasizes objective descriptions of environmental events, operational definitions, and controlled experiments while the cognitive paradigm emphasizes internal information processing and programming. The investigator who adheres to the behavioristic paradigm seeks antecedent environmental and situational events that can be related to denotable behaviors. On the other hand, the investigator who adheres to the cognitive paradigm seeks to construct a model of internal processes and structures that can lead to the observed output. These contrasting paradigms lead to different questions and to different ways of designing and conducting investigations. Furthermore, even if psychologists who adhere to these divergent paradigms obtain similar data-which is highly unlikely since they will conduct quite different studies-their paradigms will lead to divergent interpretations of the data (Katahn & Koplin, 1968). Similarly, investigators who adhere to a third paradigm that is found in present-day psychology-the Freudian paradigm-will ask another set of questions (for example, questions pertaining to unconscious processes), will gather data in a different way (for example, by inferring unconscious processes from the words and actions of clinical patients), and will relate the data to a different frame of reference (the theoretical concepts that are derived from Freud).
Each of the prevailing paradigms in psychology determine what "facts" are to be gathered and how they are to be interpreted. Kessel (1969) noted, with regard to the behavioristic paradigm, that "the behaviorist's presuppositions have led to a choice of phenomena and methods that render his position basically irrefutable: It is hardly likely that the human being will reveal 'higher-order' [mental] activities when his eye blink or knee jerk are being conditioned, or when he is learning to associate pairs of nonsense syllables" (p. 1003). To document this argument, Kessel noted, for example, how Spence, in the same way as other behaviorists, treated higher mental processes as "confounding variables" and thus could maintain his behavioristic presuppositions by controlling the "confounding variables. "
As was stated above, research results which are in harmony with a prevailing paradigm are generally viewed as acceptable whereas those which are inharmonious are generally viewed as not acceptable. This was illustrated in a recent study by Goodstein and Brazis (1970). These investigators mailed to a random sample of psychologists virtually identical abstracts of presumably empirical research on astrology. The abstracts differed in only one respect: half reported positive findings and half negative findings. Even though the purported design of the study was identical for the two sets of abstracts, psychologists receiving the abstract reporting negative findings about astrology rated the study as better designed, more valid, and as having more adequate conclusions than those receiving the abstract reporting positive findings.
There is evidence indicating that the investigator's viewpoint and his degree of orthodoxy, that is, his acceptance of a dominant paradigm, influences the editor's or the referee's decision to accept or reject his article for publication in a scientific journal (Crane, 1967; Mahoney, 1975). However, very few studies have been conducted pertaining to this kind of paradigm bias in editorial decisions and this is an important area for further research.
Although much has been said during recent years about how experimenters bias their results, comparatively little has been said about how investigators bias their results. Investigators bias their results, in accordance with their paradigm and correlated theories, at practically all stages of the research process. At the very beginning of the research there is bias in the questions that are asked and the hypotheses that are formulated. Each aspect of the research-e.g., the experimental design, the choice of subjects, the selection and training of experimenters, the analysis of data-is also biased by the underlying paradigm. Finally, the interpretations and conclusions that are drawn from the data are closely related to the underlying paradigms and associated theories (Dunnette, 1966).
PARADIGMS VERSUS PET THEORIES OR HYPOTHESES
As stated above, paradigms refer to general beliefs and methods that are shared by many scientists at a given time, for example, the behavioristic and the Freudian paradigms in psychology. Also, as stated above, investigators who adhere to a particular paradigm tend to bias their studies in line with the paradigm in many ways-in the kinds of questions that they ask, in the methods they use to answer the questions, and in the way they interpret their data. A rather different kind of bias is also present within any one paradigm. Within a paradigm, investigators inevitably differ from each other by favoring different theories or hypotheses. Although the biasing effects of the general paradigm are difficult to see and difficult to take into account, the biasing effects of pet theories or hypotheses within any one paradigm are more easily seen. Relevant data were provided by Mitroff (1974) who interviewed 42 lunar scientists, asking questions such as, "Do scientists have to be committed to their ideas?" and, "Is commitment a threat to objectivity?" He reported the following:
Of the 42 scientists interviewed, every one indicated that he thought the notion of the purely objective, uncommitted scientist was naive ... To the credit of these scientists, they not only freely acknowledged their biases but also argued that in order to be a good scientist, one had to have biases. The best scientist, they said, not only has points of view but also defends them with gusto. Their concept of a scientist did not imply that he would cheat by making up experimental data or falsifying it; rather he does everything in his power to defend his pet hypotheses against early and perhaps unwarranted death caused by the introduction of fluke data. The objectivity of science is a result, the scientists said, not of each individual scientist's unbiased outlook, but of the scientific community's examination and debate over the merits of respective biases. (p. 65)
An important proviso here, however, is that although the scientific community is ready to debate the merits of respective biases within an accepted paradigm, the biases that are inherent in the paradigm itself are much more difficult to see and to debate. For instance, the materialistic paradigm that underlies all science has practically never been criticized by scientists even though the basic tenets of materialism have been seriously questioned by philosophers at least since the days of Plato (Taylor, 1972).
RECOMMENDATIONS AND CONCLUSIONS
Of course, investigators cannot carry out research without having some basic assumptions and a way of conceptualizing the area of inquiry. Although a paradigm and associated theories are necessary for the conduct of research, investigators can become more aware of their underlying paradigm and can try to make their assumptions more explicit (Barber, 1970b; Chaves, 1968; Spanos, 1970; Spanos & Chaves, 1970).
The training of scientists should include more focused concern on the history of the sciences. The emphasis should be placed on how the accepted notions of physical, biological, and behavioral sciences have varied over time and how "facts" and "knowledge" were always relative to the preconceptions, assumptions, and paradigms that existed at a given time (Brush, 1974).
As Ziman (1968) pointed out, "the major task, and the corresponding problem of scientific education is easily defined; it must teach the consensus without turning it into an orthodoxy. The student must become perfectly familiar and at ease with the current state of knowledge and yet ready to overthrow it, from within" (p. 69). The problem, as stated by Sjoberg and Nett (1968), that must be squarely faced by the teachers of psychologists and other scientists, is "to balance off the need for sound, tested knowledge against the need for new and 'deviant' ideas in science" (p. 339).
Certainly, graduate training of psychologists and other scientists should include greater emphasis on the bias that is associated with paradigms and theories. As Dunnette (1966), McGuire (1973), and Sherif (1970) have pointed out, too many psychologists try to prove rather than modify their theories and when investigators are committed to a theory there tends to be a "unconscious" focusing on data which support the theory and a relative neglect of data that are not in harmony with the theory.
Dunnette (1966) has also recommended that some of the problems associated with the Investigator Paradigm Effect can be mollified if psychologists are taught thoroughly to test multiple alternative hypotheses rather than their one preferred hypothesis. Dunnette described this approach, which was originally formulated by Platt (1964), as follows:
The approach entails devising multiple hypotheses to explain observed phenomena, devising crucial experiments each of which may exclude or disprove one or more of the hypotheses, and continuing with the retained hypotheses to refine the possibilities that remain ... One might say that the research emphasis is one of 'studying hypotheses' as opposed to 'substantiating theories'. (p. 350)
NOTES
1. Another general problem in the publication of research findings is due to an institutionalized norm: "scientists are expected to focus their reports on the logical structure of the methods used and ... are praised for presenting their research in a way that is elegantly bare of anything that does not serve this primary function and are deterred from reporting 'irrelevant' social and psychological aspects of the research process" (B. Barber & Fox, 1958, p. 525). Scientific papers thus do not report potentially important components of the experimental research and thus tend to distort what actually goes on during the research process.
2. When experiments are reviewed in articles or textbooks, a process of "leveling, sharpening, and assimilation" commonly occurs which produces a simplified interpretation tending to be in harmony with the reviewer's assumptions and preconceptions. Interesting examples of how complex research findings are simplified in a biased way in textbooks and review articles are presented by Berkowitz (1971) and Yarrow, Campbell, and Burton (1968, pp. 132-133).
Investigators who adhere to the same paradigm and who hold similar theories may nevertheless obtain dissimilar results and draw divergent conclusions because they design their experiments differently or carry out different kinds of studies. Denzin (1970) noted that the way the study is conducted will, in part, determine the results:
Suppose that the same empirical situation is selected-for example, a mental hospital. The first investigator adopts the survey as his method; the second, participant observation. Each will make different kinds of observations, engage in different analyses, ask different questions, and-as a result-may reach different conclusions. (Of course the fact that they adopted different methods is not the only reason they will reach different conclusions. Their personalities, their values, and their choices of different theories will also contribute to this result.) (p. 12)
To illustrate the contention that the results of an experimental study can depend on the experimental design, I shall briefly discuss in turn (a) the complexity of the design, (b) whether the design takes account of sex differences, and (c) whether the experiment utilizes a same-subjects design or a randomized groups design.
Simple experimental designs are likely to yield simple results whereas more complex designs, such as factorial designs, are more likely to yield complex results that can negate the conclusions from studies using simple designs. To illustrate this contention let us look briefly at hypnosis research. During the late 1920's and early 1930's, Clark Hull was the leading investigator in this area. He typically used a very simple experimental design in which the subjects' performance was assessed under an awake condition and after they had been exposed to a hypnotic induction procedure. Hull (1933) concluded from his experiments that responsiveness to suggestions was markedly higher under hypnotic trance as compared to the waking condition. Later experiments were conducted by others which included a third experimental condition; the subjects were tested either under a hypnotic condition, an awake condition, or an additional condition in which they were urged to try to perform to the best of their ability and to try to imagine vividly those things that were suggested (task motivational condition). In general, the results of these studies indicated that under the task-motivational condition the subjects were as responsive to suggestions as under the hypnotic induction condition (Barber, 1969a). Further experiments were then conducted which used factorial designs that simultaneously assessed the independent effects and the interactions of several variables. These experiments with more complex designs obtained more complex results. How subjects performed in hypnotic experiments was found to be affected by numerous variables such as how the situation was defined to the subjects, the wording and tone of the suggestions or instructions, subjects' attitudes toward "hypnosis", and subjects' expectancies concerning their own performances (Barber, 1969a, 1971a; Barber, Spanos, & Chaves, 1974). As the experiments became more complex by including more variables in factorial designs, the data and conclusions changed and "hypnotic trance" could no longer be viewed as simply as in the days when simple experimental designs were employed.
Experimental designs which consider the results for male and female subjects separately commonly yield results which differ from those which lump the results for males and females together. The effects of sex differences on experimental results were demonstrated by Carlson (1971) when she reviewed the papers published during 1968 in the Joumal of Personality and the Journal of Personality and Social Psychology. Carlson noted that "Among the studies that could have tested for sex differences less than half reported such tests. Yet in 51 studies where sex differences were examined, significant effects of sex were found in 74 percent of the studies" (p. 205).
Whether or not the investigator uses a same-subjects design or a randomized groups design can also affect his results and conclusions. An experimental design in which one group of subjects is exposed to all of the experimental conditions (same subjects design) is clearly preferable for some problems, such as the effects of increasing amounts of practice on learning. However, in most experimental studies either the same subjects design can be used or different subjects can be randomly assigned to each treatment (randomized groups design). It is commonly assumed that the results obtained with the same subjects design are the same as would be obtained with the randomized groups design. This is not always so. For instance, Grice and Hunter (1964) showed, in experiments dealing with eyelid conditioning and with simple reaction time, that quite different results were obtained with the same subjects design and the randomized groups design.
Pereboom (1971) has given additional examples of how the experimental design, and also the scales used for measurement, can affect the results. For example, given certain kinds of variability in the data, the choice of a scale for a given response measure may reverse one's conclusions (Edgington, 1960). Similarly, the type of interactions obtained may depend upon the type of measurement scales that are used (Hays, 1963).
In brief, the way an investigator designs his experiment can affect the results he obtains. Investigators who design studies and those who utilize or review the studies of others should place greater emphasis on the fact that experimental results are dependent on the way the experiment was designed.
A third effect associated with the investigator-the Investigator Loose Procedure Effect-pertains to the degree of imprecision of the experimental script or protocol which gives the step-by-step details of the procedures to be used in the experiment. In rather rare instances, experiments do not have a formal protocol or standardized procedures. In these cases the investigator has a general idea how the experiment is to proceed, but the steps of the procedure are not planned or written out beforehand and the way the subjects are to be treated is not standardized.
An example of experiments that do not have a formal protocol can be taken from the area of hypnosis research (Barber, 1969a, 1970b). Prior to the advent of rigorous research in this area, the experimenters were instructed (by the investigators) that they were to hypnotize one group of subjects but not another group. Nothing was stated as to what was to be said to the subjects, how the hypnotizing was to be done, or how long the hypnotic procedures were to last. Of course, it is difficult to draw conclusions from experiments based on such loose protocols because the procedures can vary with the moment-to-moment predilections of the experimenter. A study based on such imprecise procedures is unscientific in that science is based on the premise that the procedures of an experiment are specified in sufficient detail so that they can be replicated in other laboratories. If the procedures are imprecise, other laboratories cannot proceed to replicate them and to crossvalidate the results.
In a somewhat more common case than the one described above, the experimental protocol has more precise specifications as to how the experiment is to be conducted, but there is still much missing and there is room for the experimenter to vary the procedure from subject to subject. For instance, experiments in psychology are, at times, based on experimental protocols which state that certain kinds of questions are to be answered by the subjects but the protocols do not state what is to be done if the subjects do not understand or misunderstand the questions. This failure to plan for contingencies is also found in loose protocols that do not state what the experimenter is to do at various steps in the procedure-for instance, how he is to interact with the subject immediately before he begins the experiment, what the experimenter is to do if the subject interrupts the experimental procedure because he wishes to smoke a cigarette, or how the experimenter is to carry out a specific test or interview procedure. The data from experiments that are based on loose experimental procedures are often reported very precisely. However, since the precise data are based on loose procedures that leave much room for bias, they can be misleading.
Loose experimental procedures can give rise to unreliable results. For instance, Feldman, Hyman, and Hart (1951) showed that experimenters obtain dissimilar data when there is a loose procedure (when they are permitted latitude in the way they word the questions that are submitted to the subjects). However, the same investigators also showed that experimenters obtain very similar data when the procedure is well structured (when the wording of the questions is clearly specified beforehand).
Raffetto (1967) has recently reported a study which illustrates the Investigator Loose Procedure Effect. The study was concerned with the effects of the experimenters' expectancies on reports of sensory experiences and hallucinations elicited in a sensory deprivation situation. Some of the experimenters were led to expect (by the investigator) that sensory deprivation produces many reports of sensory experiences and hallucinations, while other experimenters were led to expect that sensory deprivation produces few such reports. After each subject had undergone a period of sensory deprivation, he was interviewed by an experimenter. During these experimental interviews, experimenters expecting many reports of sensory experiences and hallucinations elicited more reports of this kind than experimenters expecting few such reports.
Raffetto's data indicated that the experimenters influenced their subjects reports because the experimental procedures were very loose - the interviews were not standardized and the experimenters conducted their interviews in different ways. As compared to experimenters expecting few reports of sensory experience and hallucinations, experimenters expecting many such reports more often encouraged their subjects to continue talking about their experiences, were much more active interviewers, and held much longer interviews. Thus it appears that when the investigator constructs a loose experiment and allows the experimenter to vary how he conducts the study, experimenters may conduct the study differently with subjects from whom they expect different responses, and this variability can affect the results.
In brief, if the investigator constructs a loose protocol and allows the experimenters to vary how they conduct the experimental procedures or interviews with different subjects, it is likely that the results of the experiment will be misleading. Investigators should make greater efforts to tighten their procedures and to avoid the Investigator Loose Procedure Effect.
The investigator's responsibility extends beyond deciding the kind of study to undertake, the kinds of data to gather, the type of experimental design to use, and the specific instructions and procedures to employ in the experiment. The investigator also has control of and is responsible for the data analysis even though the actual computations may be performed by an assistant or a computer. This phase of the research can easily give rise to an Investigator Data Analysis Effect.
Careful checks of statistical procedures by knowledgeable reviewers (e.g., Chapanis, 1963, pp. 310-313) at times reveal serious mistakes in the statistical analyses used in research reports which invalidate the conclusions. A survey by Wolins (1962) similarly indicated that inappropriate data analysis may not be uncommon and raised questions about investigators' willingness to permit reanalysis of their original data. Wolins asked 37 psychologists, who had recently published journal articles, for their original data. Twenty-six of the 37 (70 percent) did not reply or claimed that their original data were either lost, misplaced, or inadvertently destroyed. Finally, Wolins was able to reanalyze seven sets of data supplied by five investigators. Of the seven analyses, three involved gross errors. These errors were sufficiently great to change the conclusions reported in the journal articles. For instance, in one analysis several F ratios near one (which were clearly nonsignificant) were reported to be highly significant, and another F ratio. was incorrectly reported to be nonsignificant due to the use of an inappropriate error term.
A decade later, Craig and Reese (1973) found an improvement in investigators' willingness to show their original data but they did not ascertain whether errors in statistical analysis had decreased. They wrote to 53 authors of articles published during one month in four psychological journals and asked for copies of their original data to be used in a master's thesis.' Of the 53 authors who received letters 8 did not reply, 9 completely refused to share their data, 5 reported that the data were currently unavailable, and 4 indicated that the data had been lost or destroyed. However, the results were more encouraging than those reported a decade earlier by Wolins (1962); Craig and Reese (1973) reported that about half of the authors (27 of 53) who received letters requesting their data cooperated to some degree-20 sent their data or a summary analysis and 7 offered data if they were provided with further information.
Let us now look at eight types of data analyses that can produce biased results. Although these eight pitfalls in analyzing data tend to be overlapping, they can be best clarified by discussing the separately.
1. A serious potential pitfall is present when investigators collect a large amount of data and have not pre-planned how they are to analyze the data. Lipset, Trow, and Coleman (1970) have emphasized this pitfall, noting that "If [an investigator] is blessed with a abundance of data ... he can select those data which confirm his hypothesis that a relationship exists" [p. 83]. The major problem here is that the investigator decides how the data are to be analyzed after he has "eyeballed" or studied the data. After the investigator has perused the data, he may decide to analyze only certain parts of the data while neglecting other parts. When the investigator has not planned the data analysis beforehand, he may find it difficult to avoid the pitfall of focusing only on the data which look promising (or which meet his expectations or desires) while neglecting data which do not seem "right" (which are incongruous with his assumptions, desires, or expectations). When not planned beforehand, data analysis can approximate a projective technique, such as the Rorschach, because the investigator can project on the data his own expectancies, desires, or biases and can pull out of the data almost any "findings" he may desire.
2. Investigators at times fail to report that the data did not support their original hypothesis. Instead, after they have studied the data, they derive a new hypothesis that is supported by the data and then "verify" the new hypothesis by performing a statistical test on the same data from which it was derived (Lipset, Trow, & Coleman, 1970; Selvin, 1970). Although investigators may derive a new hypothesis from a completed study, the new hypothesis needs to be tested and verified in a subsequent study.
3. Investigators at times collect incidental data that are not directly related to the hypotheses they are testing. If they fail to confirm their original hypotheses, they then perform a large number of statistical tests on the remaining data and report whatever significant results are obtained as "findings." The rationale behind this procedure seems to be "If we don't get significant results on the variables we are interested in, then we'll have these other variables to fall back on and we'll have something 'positive' to report." These kinds of procedures can easily lead to misleading conclusions. In the next section of this chapter we shall look carefully at a report that illustrates this kind of pitfall in data analysis.
4. At times investigators conduct postmortem analyses on the same data after the originally-intended analyses have been performed and have failed to yield significant findings. This misleading procedure typically involves cutting or slicing the data in originally unintended ways. This kind of postmortem analysis can provide hypotheses to be tested in further research but it leads to misleading conclusions when the results are accepted without replication. The reason why postmortem data analyses lead to misleading conclusions is that random numbers will yield statistically significant "findings" if they are cut or subdivided in various ways and subjected to statistical analysis. This kind of Investigator Data Analysis Effect is not uncommon in psychological research (Clement, 1972) and we shall present an example of this pitfall in the next section of this chapter.
5. Investigators at times perform a large number of statistical tests and find that a small number, say 5 percent, are significant at the .05 level. The "significant" results are then reported without consideration of the fact that at least 5 percent of the comparisons will be significant at the .05 level by chance alone.
6. Investigators at times "report from among a sizable number of computed comparisons only those that are significant [but the] reader is not told about this selection" (McNemar, 1960). Of course, when many statistical tests are performed on a set of data, the alpha rate changes considerably (Blanchard, 1971). Feild and Armenakis (1974) have clearly demonstrated how multiple tests of significance can easily lead to erroneous conclusions. For instance, they state: "Suppose an investigator set his significance level at .05 and conducted 10 independent tests. He may think that his probability of Type I error [rejecting a null hypothesis when it is true] is .05. However, his actual probability of Type I error in one or more of the 10 decisions is .40" (p. 428). Neher (1967) has labeled this pitfall as probability pyrarniding and has commented as follows:
Reporting the 5 percent level for a finding means that there is only a 5 percent chance that it is a spurious finding resulting solely from chance variations. If, however, two independent analyses are done, the probability that at least one such analysis will yield a spurious, significant finding' at this level is greater than 5 percent. (The assumption of independence of the two analyses, while not always true, simplifies the discussion without introducing serious error.) To determine the new probability level, one may calculate the probability that a significant result would not be obtained in either of the two tries (.95 X .95) and then subtract this from 1. Thus, 1-(.95 )2 = 1-.902 = .098. If three independent analyses are done, the real level becomes 1-(.95 )3 = 1-.857 = .143. (Each individual analysis increases the probability pyramiding, even though it may be part of one large 'analysis', such as stepwise multiple regression, item analysis, etc.). (p. 259)
7. Related problems arise when an investigator obtains negative results (or fails to confirm his hypothesis) and then fails to report his negative results. In an interesting paper, Dunnette (1966) described his personal experiences which led him to "become aware of the massive size of this graveyard for dead studies . . ." (p. 347). Similarly, McNemar (1960) noted that investigators at times "simply discard all data of an experiment as bad data if not in agreement with theory, and start over." The problem here is that if the investigator obtains positive results in a later study and publishes them without mentioning his earlier negative results, the reader is likely to conclude wrongly that the positive results are more stable, more easily replicable, or more valid than is actually the case. Rhine (1974b) has appropriately pointed out that, "The subtle private judgments about what data to 'declare' in reporting constitute an area that needs the fullest possible safeguarding" (p. 110).
8. When an investigator obtains negative results that fail to confirm his hypothesis he is likely to check for computational errors in the data analysis or to run another data analysis (Friedlander, 1964). However, when the original analysis confirms the investigators' hypothesis, it is unlikely that he will check for computational errors or run another analysis. To illustrate these pitfalls, Friedlander (1964) courageously offered himself as an example, describing how he looked for mistakes in the data analysis when it yielded results that contradicted his expectations. He concluded that research investigators tend to accept the adage, "If you don't succeed at first, try and try again", and they also accept the adage, "If you do succeed at first, do not try again" (p. 199).
9. At times, investigators place heavy emphasis upon a statistically significant outcome but fail to point out that the degree or strength of association between the two variables is actually very small or negligible (Kish, 1970). A significant value of F, t, or chi-square means that probably there is some dependence between the variables in the population, but the degree of dependence may be practically zero regardless of the significance level (Duggan & Dean, 1968). Kish (1970) appropriately pointed out that, "The results of statistical 'tests of significance' are functions not only of the magnitude of the relationships studied but also of the number of sampling units used (and the efficiency of design). In small samples, significant, that is, meaningful results may fail to appear 'statistically significant.' But if the sample is large enough the most insignificant relationships will appear 'statistically significant' " (pp. 138-139). There are several ways to avoid this pitfall. Instead of presenting results in terms of tests of significance, they could be presented in terms of confidence intervals (Natrella, 1960). Probably a better way of avoiding the pitfall is to present an estimate of the strength of association along with the statistical test of significance (Dunnette, 1966, p. 350). Such measures of degree or strength of association include, for example, Goodman and Kurskal's gamma for chi-square, r2 for Pearsonian correlations, omega squared, and many others that are discussed by Cohen (1965), Fleiss (1969), and Keppel (1973, Chap. 25)
PITFALLS IN DATA ANALYSIS: TWO ILLUSTRATIVE STUDIES
To illustrate the pitfalls that were listed above, we shall analyze two important and influential studies that are directly pertinent to the topic of this book. The two illustrative studies aimed to demonstrate one of the major pitfalls discussed in this text (Pitfall X), namely, that experimenters unintentionally and subtly communicate their expectancies to their subjects and the subjects respond in accordance with the experimenters' expectancies (Experimenter Unintentional Expectancy Effect). Paradoxically, while trying to demonstrate one of the major pitfalls, the Experimenter Unintentional Expectancy Effect, these studies seem to demonstrate another one of the major pitfalls, the Investigator Data Analysis Effect.
Illustrative Study 1
In the first study (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964b), 20 student experimenters were asked to test a total of 73 subjects on Rosenthal's person-perception task. When using this task, the subject is shown a series of photographed faces. The subject is asked to rate on a numerical scale whether each of the individuals shown on the photographs has been experiencing success (high ratings) or has been experiencing failure (low ratings).
The study was designed to show that experimenters obtain ratings from their subjects that they expect to obtain. To induce the student experimenters to expect high (or low) ratings from their subjects on the person-perception task, each experimenter was told (by an investigator) that, on the basis of personality tests given to the subjects, it could be expected that certain of their subjects would perceive the photographed individuals as successful (high ratings) and other specified subjects would see them as failures (low ratings). (Since the subjects were not given the personality tests and were randomly assigned to the experimenters, the subjects should not actually differ in their ratings.) The dependent variable was the difference between the average ratings obtained by each experimenter from those subjects whom he expected would give high ratings and those subjects whom he expected would give low ratings.
The investigators (Rosenthal et al., 1964b) did not perform an overall statistical analysis of the data to determine if the subjects' ratings were harmonious with the experimenters' induced expectancies for high or low ratings. Instead of determining first whether the data showed the hypothesized Experimenter Unintentional Expectancy Effect, the investigators stated first that 3 of the 20 experimenters showed a "reversal of the biasing effect of expectancy, i.e., they obtained data significantly opposite to what they had been led to expect." The investigators then analyzed the data for the remaining 17 experimenters and reported that these experimenters showed a significant Experimenter Unintentional Expectancy Effect, that is, they obtained ratings from their subjects in line with their (the experimenters') expectancies.
There are several interrelated reasons why this conclusion - that the study showed an Experimenter Unintentional Expectancy Effect - cannot be accepted as valid: (a) The investigators concluded that the effect was present after performing an analysis that did not include the negative data (in the opposite direction) that were obtained by 3 of the 17 experimenters. (b) The negative data were excluded from the analysis (which supposedly showed the Experimenter Unintentional Expectancy Effect) after the investigators had inspected the data and after they had determined that some of the data were negative with respect to the experimental hypothesis. (c) The investigators were not using the acceptable procedure of excluding data by means of a criterion that was determined prior to inspection of the data. (d) The way the data were analyzed did not allow for the possibility that the study may have simply failed to show an Experimenter Unintentional Expectancy Effect. In another connection, Chapanis and Chapanis (1964) presented several reasons why these kinds of statistical procedures lead to misleading conclusions:
Unfortunately, this line of reasoning [that data which are counter to the hypothesis can be excluded from the analysis which aims to test the hypothesis] contains one fundamental flaw: it does not allow the possibility that the null hypothesis may be correct. The [investigator] , in effect, is asserting that his ... prediction is correct and that Ss who do not conform to the prediction should be excluded from the analysis. This is a foolproof method of guaranteeing positive results.
Some people may feel that no matter how questionable the selection procedure, it must still mean something if it leads to significant results. This point of view, however, cannot be reconciled with the following facts of life: it is always possible to obtain a significant difference between two columns of figures in a table of random numbers provided we use the appropriate scheme for rejecting certain of these numbers ...
We strongly recommend that Ss not be discarded from the sample after data collection and inspection of the results. Nor is it methodologically sound to reject Ss whose results do not conform to the prediction ... If there are any theoretical grounds for suspecting that some Ss will not show the predicted ... effect, the characteristics of such Ss, or the conditions, should be specifiable in advance. It should then be possible to do an analysis on all Ss by dividing them into two groups, those predicted to show [the effect] and those predicted not to show it . (pp. 16-17)
In brief, no confidence can be placed in research reports that conclude that the hypothesis was confirmed by a statistical analysis which excluded the data that were judged, after inspection of the results, to be significantly opposite to the hypothesis. When the data of the study are analyzed appropriately using all 20 experimenters, there is no significant difference between the ratings obtained when the experimenters expected high ratings and when they expected low ratings.
Illustrative Study 2
We shall further illustrate the Investigator Data Analysis Effect by looking closely at a widely quoted study (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963) which aimed to demonstrate that (a) experimenters unintentionally communicate their expectancies to their subjects, (b) the subjects then respond in accordance with the experimenters' expectancies, (c) experimenters also unintentionally communicate their expectancies to their assistants, and (d) when the assistants henceforth test subjects, they also unintentionally obtain data in line with their expectancies.
There were 14 experimenters in the study who tested 76 subjects on the person-perception task. Each experimenter later trained two assistants and the 28 assistants then tested 154 additional subjects on the same task. The three major independent variables were as follows:
1. The experimenters first stated the average ratings they expected to obtain from their subjects on the person-perception task. These expectancies were termed the experimenters' idiosyncratic expectancies or biases.
2. Before testing their subjects, half of the experimenters were told (by the investigator) that they should expect to obtain high ("success") ratings from their subjects and half were told that they should expect to obtain low ("failure") ratings and they were given a rationale why they should expect to obtain such ratings. These expectations were labeled as induced expectancies.
3. Subsequently, the experimenters were led to expect (by the investigator) that their assistants would obtain the same high (or low) ratings from their subjects that they (the experimenters) had been originally led to expect. However, the investigator warned the experimenters not to tell their assistants what type of ratings they should expect. After the assistants had tested their subjects, analyses were performed to determine if the assistants obtained ratings which were in line with either the idiosyncratic or induced expectancies of the experimenters who had originally trained them.
The major dependent variables in this study were the ratings on the person-perception task given by the subjects tested by the experimenters and by the subjects tested by the assistants. The authors of the paper reported that "The 2 X 2 analysis of variance, based on Es' Ss' ratings as a function of Es' induced and idiosyncratic [expectancy] bias yielded no F with an associated p < .15" [p. 321]. This statement means that the Experimenter Unintentional Expectancy Effect was not demonstrated in this study-the experimenters did not obtain ratings from their subjects which agreed with what the experimenters originally expected to obtain (idiosyncratic expectancies) or what the investigator told them to expect (induced expectancies). In addition, the authors of this paper presented an analysis of variance which indicated that the assistants were not significantly influenced by their experimenters to obtain ratings in accord with the experimenters' idiosyncratic or induced expectancies.
The conclusion indicated by the analyses mentioned above is that the study had not demonstrated an Experimenter Unintentional Expectancy Effect and also had not demonstrated an Assistant Unintentional Expectancy Effect-that is, neither the experimenters' nor the assistants' expectancies had influenced their subjects' ratings on the person-perception task. The investigators (Rosenthal et al., 1963), however, failed to draw the conclusion that the study had failed to demonstrate expectancy effects. Instead, they went on to perform additional statistical analyses among the variables mentioned above and many other variables. We shall now summarize Barber and Sflver's (1968a, 1968b) critique of these additional analyses in order to illustrate several types of Investigator Data Analyses Effects.
1. These additional statistical analyses took into consideration at least 22 independent variables and 4 dependent variables. The 22 independent variables included, for example, the experimenters' and the assistants' idiosyncratic and induced expectancies and a variety of personality characteristics of the experimenters, the assistants, and the subjects. The 4 dependent variables included two methods for measuring the effects of the experimenters' expectancies and two methods for measuring the effects of the assistants' expectancies.
2. When a study includes many independent and many dependent variables, the investigator should perform a multivariate analysis (e.g., factor analysis, multiple-discriminant analysis, canonical correlation, multivariate analysis of variance or covariance). A multivariate analysis applied to multivariate data can yield unambiguous conclusions concerning the effects of the independent variables on the dependent variables. However, if a multivariate analysis is not performed - if the investigator analyzes the data bit-by-bit (for example, analyzes the effects of one or more independent variables separately on each dependent variable) - serious problems of probability pyramiding arise (Neher, 1967).
3. The investigators (Rosenthal et al., 1963) did not perform a multivariate analysis and, in fact, could not do so because such an analysis requires a very large number of subjects.
4. Although the 22 independent variables and the 4 dependent variables could give rise to thousands of possible statistical comparisons (Guilford, 1954, p. 80), the investigators analyzed only a small fraction of the data bit-by-bit, utilizing primarily t tests and Spearman rhos. The investigators performed about 125 statistical comparisons of which about 21 were significant at the .05 level.
5. If the investigators had made planned comparisons (that is, if they had specified in advance which comparisons were to be made), and if each comparison was independent of the others, one would expect at least 6 of the 125 to be "significant" at the .05 level by chance alone.
6. However, since the investigators were not making planned comparisons and since at least 5 percent of the thousands of possible comparisons that were present in the data could be expected to be "significant" at the .05 level by chance alone, it is difficult to determine exactly how many of the 125 comparisons (that were selected by unclear criteria from the thousands of possible comparisons) might be "significant" by chance alone but this number could easily have exceeded 21 (cf., Hays, 1963, Chap. 14).
7. Furthermore, of the 21 comparisons that were found to be nominally "significant," about 15 involved overlapping data. When separate statistical tests are made on overlapping sets of data, it can be expected that if one set is significant the other set (which includes much of the same data) may also be significant and the two statistics cannot be considered as independent of each other. When the statistical comparisons are not independent, the percentage of comparisons that can be expected to be significant at the .05 level by chance alone far exceeds 5 percent.
8. Of the remaining 6 nominally significant statistics, 5 were t tests that were performed upon data which had been first tested for significance by overall F tests. The F tests failed to show significant effects and the null hypothesis should have been accepted. When the preliminary analysis of variance does not show overall significance, postmortem analyses of the same data by means of t tests yields uninterpretable results (Hays, 1963, p. 483).
In brief, although this study (Rosenthal et al., 1963) was interpreted as showing an Experimenter Unintentional Expectancy Effect and also an Assistant Unintentional Expectancy Effect, the interpretation was not valid. The study was inappropriately analyzed and the investigators drew conclusions that were not justified by the data - that is, the conclusions were based on an Investigator Data Analysis Effect.
SUMMARY
Let us now summarize some of the Investigator Data Analysis Effects that were found in the two illustrative studies described above.
In the first study, (a) an overall statistical analysis was not performed to reject the null hypothesis, (b) negative data (data which were significantly opposite to the experimental hypothesis) were not used in the statistical analysis which supposedly confirmed the experimental hypothesis, and (c) the decision not to use the negative data was made after inspection of the results and without a predetermined rationale.
Some of the pitfalls in the second study were as follows: (a) After an overall analysis had failed to reject the null hypothesis at a conventional level of significance, the investigators performed a large number of postmortem statistical tests on the data. The investigators failed to make clear that the results of such postmortem analyses are far from definitive and can, at best, only suggest new hypotheses to be validated in further research. (b) Problems of probability pyramiding were not avoided (Neher, 1967); for example, there was a failure to take account of changing levels of significance when many statistical tests were performed on a single set of data (Feild & Armenakis, 1974; Ryan, 1959). (c) The investigators strained for significance by accepting p values greater than .10 as confirming the experimental hypothesis. (d) The investigators failed to perform a multivariate statistical analysis, such as multiple analysis of variance, in a study which included many independent and many dependent variables. Instead, a large number of comparisons were made on overlapping data by individual t tests and Spearman rhos.
RECOMMENDATIONS
The above considerations suggest that some of the many ways that an investigator can avoid an Investigator Data Analysis Effect is to adhere to the following principles:
1. If the investigator is not using the technique of planned comparisons-that is, if the particular comparisons that are to be made are not specified in advance (Hays, 1963, Chap. 14) - an overall statistical test should be performed that includes all of the data.
2. The probability value required for rejection of the null hypothesis should be specified in advance.
3. Conclusions should not be drawn from the results of postmortem tests performed upon the data after an overall test has failed to reject the null hypothesis. The results of such postmortem tests should be "substantiated in independent research in which they are specifically predicted and tested" (Kerlinger, 1964, p. 621).
4. The statistical analyses should avoid errors of probability pyramiding (Feild & Armenakis, 1974; Neher, 1967), for example, the error of "finding some significant F ratios in an experiment by complicating the experiment with more and more irrelevant variables, while continuing to base the error rate upon the individual F" (Ryan, 1959).
5. If many independent and many dependent variables are used in one study, they should be clearly specified before hand, a large number of subjects should be used (so that there are sufficient numbers of subjects in each cell of the experimental design), and the data should be analyzed by multivariate procedures such as multiple discriminant analysis, multivariate analysis of variance or covariance, canonical correlation, or factor analysis. The analysis of multivariate studies should not be carried out piecemeal by individual t tests, Spearman rhos, chi-squares, etc. (Cattell, 1966).
6. Instead of including many independent and dependent variables in a study (which requires a large number of subjects if the investigator is to carry out an appropriate multivariate analysis), the investigator might consider the advantages of keeping the number of variables within manageable proportions. As Hays (1963) cogently pointed out:
In planning an experiment, it is a temptation to throw in many experimental treatments, especially if the data are inexpensive and the experimenter is adventuresome. However, this is not always good policy if the psychologist is interested in finding meaning in his results, other things being equal, the simpler the psychological experiment the better will be its execution, and the more likely will one be able to decide what actually happened and what the results actually mean. (p. 411)
MOTIVATIONS FOR POSITIVE RESULTS
To reduce the extent of the Investigator Data Analysis Effect, it is necessary to emphasize in the training of behavioral scientists how and why this effect exists and how investigators should take pains to avoid it. Of course, this depends primarily on our teachers in psychology and the behavioral sciences and their willingness to talk about this effect openly and to continuously caution their students about it.
Although the Investigator Data Analysis Effect can be reduced by bringing it out from behind closed doors and talking about it openly, nevertheless, there are strong motivations that tend to give rise to this effect and, as long as they exist, we can expect the effect to occur. To further reduce the prevalence of this effect it is necessary to remove the motivations. One of these motivations derives from the belief that investigators will not be able to publish their research in professional journals if they do not report positive results. Let us look more closely at the complexities involved in publishing papers.
McNemar (1960) conjectured that studies with non-significant results are usually not submitted for publication; investigators commonly select their significant findings for inclusion in their reports. This conjecture was confirmed by Sterling (1959) when he surveyed all of the papers published during 1955 in four major psychological journals. In 97 percent of the studies that used statistical tests, the null hypothesis was rejected, that is, "positive" findings were reported. More recently, Bozarth and Roberts (1972) checked all of the articles published from January 1967 to August 1970 in three journals concerned with counseling psychology' and Greenwald (1975) checked the articles published during 1972 in the Joumal of Personality and Social Psychology. Bozarth and Roberts reported that, of the studies using statistical tests, 94 percent rejected the null hypothesis and Greenwald found that 88 percent of the articles reported positive results. It thus appears that nonsignificant results are either rarely submitted for publication and/or rarely accepted for publication.
The implication of the above, that there exists a misleading selection of "significant" results for publication in journals, is supported by Cohen's (1962) analysis of papers published during 1960 in the Journal of Abnormal and Social Psychology. He analyzed 70 studies with regard to the power of the statistical tests that were used. (The power of a test is directly proportional to the size of the sample.) He found that the power of the tests, that is, the probability of rejecting the null hypothesis of no difference when there actually was a difference, was typically meager. That is, the size of the samples were typically too small to expect that the statistics would yield significant results very often even when the null hypothesis was false. However, with few exceptions, each of the 70 studies reported "significant" results even though the statistical tests were usually not sufficiently powerful to detect "significance" with the relatively small samples that were used. The results of Cohen's analysis can be interpreted as indicating either that (a) investigators "find" significance in their data even though their statistical tests are not sufficiently powerful to detect "significance" with the typically small samples that are used, (b) they select only their significant findings for publication and do not submit their negative findings, or (c) journal editors select the significant findings for publication and reject the negative findings.
In line with the above, Smart (1964) noted that there appear to be two main reasons why studies with negative results are rarely published: (a) Authors are more likely to submit their positive rather than their negative results for publication. (b) Negative results are subjected to more editorial scrutiny. Support for the latter contention is found in a recent investigation (Mahoney, 1975), and also in an editorial statement made by a former editor of the prestigious Journal of Experimental Psychology. The editor (A.W. Melton) stated that he was very reluctant to publish results that were not significant at the .01 level (Bakan, 1967). Smart (1964) noted the following problems that arise from these practices: (a) If researchers are aware of studies supporting a hypothesis but not those which did not support it, they are misled into believing that the hypothesis is more valid than is actually the case. (b) Without an awareness of negative results in an area, other investigators are unable to make improvements in their experimental designs which might lead to positive results. Another problem, which is not mentioned by Smart, is that the emphasis on positive results may lead investigators to perform inappropriate data analyses so as to obtain "positive" results.
The notion that journal editors tend to reject reports of negative results is true in some cases (see Melton's statement above), but this notion is also misleading. It is not negative results per se that are difficult to publish but results that are judged to be meaningless, trivial, or as failing to enhance understanding. Both negative results and positive results can fail to contribute to knowledge or theory or be meaningless or trivial. As stated in the Publication Manual of the American Psychological Association (Anonymous, 1974, p. 22), positive results with regard to a trivial question "or devoid of theoretical explanation" are practically as valueless as negative results with regard to the same question. Good research answers meaningful questions and a meaningful question can be answered either by Yes (positive results) or No (negative results). For example, in the area of hypnotism my co-workers and I have asked, Is a standardized hypnotic induction procedure more effective than brief task motivational instructions in enhancing responsiveness to test suggestions as measured by the Barber Suggestibility Scale? Although a series of experiments provided a "No" answer to this question (negative results), they were all readily publishable because they answered an important question (Barber, 1969a). As the Publication Manual of the APA (Anonymous, 1974) states, negative results are of interest to editors (a) "when an established theory clearly predicts that a difference or correlation should be found" and also (b) "when an investigator discovers a methodological weakness in a published report of positive results and, correcting the weakness, finds that the significances vanish" [p. 21].
In brief, the major problem is not in the results but in the questions that are to be answered. If investigators asked meaningful questions, the answers to the questions would themselves be meaningful regardless of whether the answer is Yes (positive results) or No (negative results). However, the notion that negative results are difficult to publish has a basis of truth. Not all negative results but certain types of negative results are difficult to publish. The type of negative results that are difficult to publish are specified as follows by the Publication Manual of the APA (Anonymous, 1974): "Failure to replicate results of a previous investigator, using the same method but a different sample, is generally of questionable value. A single failure may merely testify to sampling errors or to the conclusion that one of the two samples had unique characteristics responsible for the reported effect, or the lack of effect. An author can resolve the issue when he reports several failures with a range of samples. A single failure is too equivocal to justify publication on its merit alone" (pp. 21-22).
The above is related to a more general problem. When no relationship is found between an independent and a dependent variable, there are many reasons why we cannot conclude that there is actually no relationship in addition to the statistical point that we can never prove the null hypothesis: for instance, the independent variable may not have been successfully manipulated and the measure of the dependent variable may have been inadequate (Mills, 1969).
Investigators could consider the problems associated with negative results within the following perspective: (a) Many questions, when answered, contribute to knowledge regardless of whether the answer is Yes (positive results) or No (negative results). (b) Many questions can be worded in a way that avoids the problem of negative results. For instance, instead of testing a null hypothesis, such as "Hypnosis is no more effective than task motivational instructions in enhancing response to suggestions," the question can be worded in such a way that either a Yes or a No answer is equally enlightening, for example, "Is a procedure that includes many components (a hypnotic induction procedure) more effective than one of its components (task motivational instructions) in raising suggestibility?"
After discussing problems similar to those delineated above, Greenwald (1975) came to similar conclusions: "1. Do research in which any outcome (including a null one) can be an acceptable and informative outcome. 2. Judge your own (or others') research not on the basis of the results but only on the basis of adequacy of procedures and importance of findings" (p. 19).
NOTES
1. The journals were Journal of Comparative and Physiological Psychology, Journal of Personality and Social Psychology, Journal of Verbal Learning and Verbal Behavior, and Journal of Educational Psychology.
2. Lykken (1968) and also Minturn (1971) have presented an additional tongue-in-cheek solution to the related pitfall of confusing statistical significance with the potential replicability of results. They noted that the confidence that an investigator actually has in his findings may differ from the p values that he reports, because the investigator is aware of far more about his research than is reflected in his p values. Consequently, Lykken (1968) and Mintum (1971) proposed the "test of the gambler's challenge" or a "Wagers" section in journals where the author bets a certain amount of money on the repeatability of his results.
3. Most of the studies conducted prior to 1968 which were interpreted as demonstrating an Experimenter Unintentional Expectancy Effect (Rosenthal, 1966, 1968) did not actually show this effect; instead, many of the studies seemed to show an Investigator Data Analysis Effect (Barber, 1969b; Barber & Silver, 1968a, 1968b). A statistical analysis of 12 additional studies, which was interpreted as showing an Experimenter Modeling Effect (Rosenthal, 1966), was also inappropriate, that is, it also showed an Investigator Data Analysis Effect (Silver, 1968). Additional Investigator Data Analysis Effects have been delineated by Elashoff and Snow (1971) in their detailed critique of the kinds of data analyses that were used in the famous and influential book entitled Pygmalion in the Classroom: Teacher Expectation and Pupils' Intellectual Development (Rosenthal & Jacobson, 1968).
4. The journals were Journal of Experimental Psychology, Journal of Comparative and Physiological Psychology, Journal of Clinical Psychology, and Journal of Social Psychology.
5. The journals were Personnel and Guidance Journal, Journal of Consulting and Clinical Psychology, and Journal of Counseling Psychology.
6. In Sterling's (1959) survey, none of the 362 journal reports were replications of previous studies and, in the Bozarth and Roberts (1972) survey, less than 1 percent of the articles were replications of previous studies.
7. Greenwald (1975) also noted that there are several commonly accepted notions about the null hypothesis and about negative results that are misleading. One such notion is that since one cannot prove the null hypothesis, therefore, no conclusions can be drawn from negative results. Greenwald pointed out the misleading features of this contention as follows:
The notion that you cannot prove the null hypothesis is true in the same sense that it is also true that you cannot prove any exact (or point) hypothesis. However, there is no reason for believing that an estimate of some parameter that is near a zero point is less valid than an estimate that is significantly different from zero. Currently available Bayesian techniques (e.g., Phillips, 1973) allow methods of describing acceptability of null hypothesis. (p. 2)
Greenwald (1975) next considered the argument that science advances by discovering relations between variables, that is, by rejecting the null hypothesis. He noted that "This argument ignores the fact that scientific advance if often most powerfully achieved by rejecting theories (cf., Platt, 1964). A major strategy for doing this is to demonstrate that relationships predicted by a theory are not obtained, and this would often require acceptance of a null hypothesis" (p. 2). After presenting a series of additional cogent arguments, Greenwald (1975) concluded that, "Support for the null hypothesis must be regarded as a research outcome that is acceptable as any other" (p. 16).
For the sake of completeness, it is necessary to discuss a taboo topic - the Investigator Fudging Effect. This effect is present when an investigator intentionally reports results that are not the results he actually obtained. In this chapter, I shall first summarize some of the relevant data pertaining to the Investigator Fudging Effect and then I shall discuss the motivations and countermotivations for fudging.
SOME INSTANCES OF FUDGING
Newton, Dalton, Mendel
Although outright fraud (fudging of all or most of the data) is probably very rare in the behavioral sciences, "pushing the data", or letting desires and biases influence the way the data are analyzed or reported, may not be too rare. For instance, if an investigator finds that the statistical test of his hypothesis is approaching significance at, say, p = .15, he may fudge the p value by changing it to p = .05. This type of fudging has been noted by many students of Scientific history. For instance, after discussing instances of outright fraud in science, Merton (1957) noted that probably much more common are instances of "trimming" or "cooking" the data which are probably due to excessive concern with success in scientific work.
It appears that even Isaac Newton indulged in "small-scale" fudging to make his data appear more precise than they actually were. Westfall (1973) presented a series of examples in which Newton's measurements matched his theoretical predictions to a degree of accuracy that was impossible at that time. Westfall commented as follows:
And having proposed exact correlation as the criterion of truth, [Newton in his Principia took care to see that exact correlation was presented, whether or not it was properly achieved. Not the least part of the Principia's persuasiveness was its deliberate pretense to a degree of precision quite beyond its legitimate claim. If the Principia established the quantitative pattern of modern science, it equally suggested a less sublime truth-that no one can manipulate the fudge factor quite so effectively as the master mathematician himself. (pp. 751-752)
Along similar lines, it appears that Dalton (or possibly his assistants) may have fudged some of his data on chemical atomism (Brush, 1974). Also, it appears that Mendel (or possibly his assistants) fudged some of his data on genetics (Brush, 1974). Relevant here is the demonstration by Ronald Fisher, the noted statistician, that the data in Mendel's original paper on heredity could not have been true "because it was inconceivable, short of an 'absolute miracle of chance', to obtain these ratios" (Koestler, 1971, p. 56). Questions pertaining to fudging of data have also arisen more recently in the sciences. Let us look at a representative case.
The Summerlin Case
William T. Summerlin, a scientist at the Sloan-Kettering Institute for Cancer Research, recently admitted fudging data in a very important investigation. This investigator had reported a series of studies indicating that, when skin and other organs are maintained for a time in tissue culture, they lose their ability to provoke an immune response. The important implication of these reports was that organs could be transplanted between genetically non-related individuals without the organ being rejected. Summerlin admitted that he had fudged data that he presented to the head of the institute. Specifically, he had painted the skin of two mice to make them falsely appear that they had been successfully grafted. Summerlin was also charged with irresponsible conduct by an investigating committee for misrepresenting other experiments which supposedly indicated successful transplants of human corneas (Culliton, 1974).
Faber (1974) cogently commented on the implications of the Summerlin case as follows:
We are naive to believe that dishonesty in research is unique and aberrant. The rewards are just too tempting: prestige, ego enhancement, promotion, and, as in the case of Summerlin, a $40,000 salary and a home in Darien, Connecticut. Mighty tempting rewards for success. Not only are the rewards tempting but, while the process of socialization in graduate school may give credence to veracity, it nonetheless emphasizes success. The emphasis on scientific success creates a severe strain on the practicing researcher, who is tom between the norms established for the process of research and the penultimate rewards for success. Under these conditions deviance is likely to occur in any group, even among scientists. (p. 734)
Parapsychology
The problem of deception or fudging of the data has been especially critical for the area of parapsychology. Over the years, researchers in parapsychology have instituted a wide variety of controls which have met the criticisms that have been leveled at the field. However, when all other criticisms have been answered, there still remains one criticism that has prevented some scientists from fully accepting the findings from parapsychology. This criticism is simply that parapsychological researchers can produce significant ESP results by fudging a very small part of their data (Hansel, 1966; Price, 1955). Since scientists are aware that fudging a small part of the data to make the study "come out right" or to obtain statistically significant results is not too uncommon among their own associates, they can easily see this happening also in parapsychological research. Although scientists are aware that "small-scale" fudging occurs among their own colleagues, there are two reasons why they view this kind of fudging as more serious in parapsychology: (a) they believe ESP is inherently improbable or highly unlikely whereas they believe their own field is well-established. (b) They believe that, in their own field, replication of studies by independent laboratories determines the validity of original findings whereas, in the area of parapsychology, cross-validation by independent laboratories is very rare.
J.B. Rhine, who has been at the forefront of research in parapsychology for many years, has once again demonstrated his role as a leader and ground-breaker by facing the issue of deception and fudging without equivocation. In three recent important articles (Rhine, 1973, 1974a, 1974b), he discussed in detail how serious workers in parapsychology have been plagued for many years by the problem of fudging, especially among the new workers in parapsychology.
In his first paper, Rhine (1973) presented several cases of new students who came to the Duke Parapsychology Laboratory and who seemed headed for a career in parapsychology but whose reporting of results was not reliable, which they admitted when faced with the evidence. These students were advised to "seek a career in a less sensitive field" and they concurred.
Rhine (1973) then went on to discuss cases of individuals who had obtained the doctorate in an established field, and who had already reported successful research on ESP. However, the reports presented by individuals in this group at times did not spell out details regarding the usual safeguarding conditions for ESP research. When asked to specify the safeguarding conditions or to improve them, some of the individuals did so and became established researchers but most of the individuals in this category did not do so; in fact, they lost interest in ESP research and did not publish further papers on the topic. In some of these cases "Evidence of altered records led to the suggestion of dishonesty, and when [the individual] was confronted with the evidence, he quickly and quietly dropped all contact with parapsychology" (p. 364.)
Rhine (1973) commented on these cases as follows:
Yet one must wonder why such a weak and stupid course would be followed, even if rarely, by mature, intelligent, educated individuals already established in much more secure professions than parapsychology. Obviously these [individuals] were not among the strongest and most successful members of their own disciplines. Also, in their superficial view of psi research they probably received a false conception of how easy it would be to gain quick notoriety and advancement in it. The field is of course open to anyone, with almost no checks and balances until a report is submitted. The more accepted rules and standards of psi research are not much in evidence especially to outsiders, and other fields do not raise the strict questions regarding significance, controls, and confirmation that parapsychology editors do. So a would-be experimenter who is of course new to the strictures of this branch of science would naturally be quite surprised if his test results should not reveal the psi effects he had anticipated. Since he knows that positive results were supposed to be there, he might be tempted to 'top-off' the data to round out the expected result which he has been led to assume. After all, he may easily suspect that this is the way data are 'topped off' and 'rounded out' in many other professions and disciplines, that nobody will be the worse for it, and that plenty of others are probably doing it.
Other pressures too may support him in his attitude. [He] may strongly wish to have a paper accepted for publication or for a convention program, either to help his status or his vanity, or both. The worldwide public attention which has been given to parapsychology has admittedly been the envy of some of the people in other less popular fields. Odd as it may seem, an almost fanatic urge to share in this sort of fame takes hold of some individuals. (pp. 364-365)
In his second paper, Rhine (1974b) discussed a dozen cases "to illustrate fairly typically the problem of experimenter unreliability prevalent in the 1940's and 1950's" (p. 104). With regard to these twelve individuals, "four of them were caught 'red-handed' in having falsified their results; four others did not contest (i.e., tacitly admitted) the implications that something was wrong with their reports that seemed hard to explain and they did not try. In the case of the remaining four the evidence was more circumstantial, but it seemed to our staff they were in much the same doubtful category as the other eight" (p. 104).
Rhine (1974b) stated that during the past 20 years there has been a marked reduction in this type of chicanery, primarily because such risky personnel have been avoided and because steps have been taken to make it very difficult for "dishonesty to be implemented inside the well-organized psi laboratory today" (p. 105). Despite these precautions, a case of dishonesty was discovered soon after Rhine wrote the above words in March, 1974. In June, 1974, Rhine (1974a) wrote as follows in his third paper:
When I wrote my paper on deception for the March issue of the Journal [of Parapsychology] I had not expected to come back to the subject again in publication. I thought experimental parapsychology was heading into a stage of successful avoidance of the problems of experimenter dishonesty. Accordingly, I was shocked to discover, only a few months later, a clear example of this same problem, not only right here at the Institute for Parapsychology, but even involving an able and respected colleague and trusted friend. (p. 215)
In this case, a suspicious research assistant concealed himself during the experiment and observed the behavior of Walter J. Levy, Jr., who was a major investigator in parapsychology. While he was secretly observed, Levy improperly altered the data. When Rhine confronted Levy with the observations of the research assistant, Levy acknowledged that he had fudged the data of the experiment and resigned from the staff of the Institute for Parapsychology. Levy stated that he had fudged the data of the experiment he was conducting because, contrary to his previous experiments, the results were at a chance level and he wanted to bring the results up to a significant level so that others would be stimulated to replicate his previous significant studies. Rhine (1974a) concluded from this case of fudging, along the lines of his earlier articles (Rhine, 1973, 1974b), that (a) "the necessity of trusting the experimenter's personal accuracy or honesty must be avoided as far as possible," (b) a method that can help avoid reliance on the investigator's honesty is to involve a number of investigators in each study and, (c) "each new experiment must be considered in effect only a pilot project until it is eventually repeated by others; and if an important finding is at stake, the more repetitions, the better" (p. 220).
Other Recent Cases
Although outright fraud (fudging of all the data), of the kind that anthropologists discovered with regard to Piltdown man (Jastrow, 1935; MacDougall, 1958; Tullock, 1966), appears to be very rare in scientific research, every once in a while a case is reported of scientists who were caught fudging some of their data. For instance, not too long ago, papers that were published in prestigeful journals (Science and Journal of Infectious Diseases) were shown to contain fraudulent data (DuShane, Krauskopf, Lerner, Morse, Steinbach, Strauss, & Tatum, 1961).
Fudging of data was also recently demonstrated among physician-researchers who were paid by pharmaceutical companies to evaluate the effectiveness of new drugs. In 1967 a committee from the Food and Drug Administration investigated the validity of the physicians' reports. About one-third of the physicians who were investigated (16 of 50) were found to have supplied fabricated data on the new drugs to the sponsoring drug companies and to the government (N.W., 1973).
Serious questions have also been raised recently with regard to Sir Cyril Burt's results which have been used to support the genetic viewpoint in the recent controversy pertaining to inheritance of intelligence. Kamin (1973) made some surprising discoveries when he looked closely at Burt's data pertaining to IQ in monozygotic twins reared apart. In 1955 Burt reported that he had tested 21 sets of twins, in 1958 he reported that he had tested over 30, and in 1966 he reported that the number tested had reached 53. In each of these three papers, Burt reported that the correlation between the IQs of the twins was .771. It appears almost certain that some part of the data that were reported over the years was incorrect. Either Burt's sample did not increase as he reported or the IQ did not remain perfectly constant (to three decimal places) with an increasing sample from 21, to 30, to 53 pairs of twins. The probability of obtaining three identical correlations is so astronomically small that it seems appropriate to conclude that Burt was either extremely careless in reporting his data or misreported them.
Relevant to the above is the evidence indicating that "cheating" is the norm in a variety of situations and that honesty is present only when individuals are clearly aware that the odds are high that they will be discovered and punished for dishonesty. For instance, in a study with three sociology classes, Tittle and Rowe (1973) ascertained how many students cheated when they were allowed to score their own examinations. They found that only 5 of 107 students totally refrained from cheating during the entire quarter and they concluded that "conformity to the norm of honesty in the classroom situation is unlikely in the absence of control efforts by the instructor" (p. 496). We may deduce from these findings that stronger sanctions are necessary in education and also in science to prevent dishonesty. Let us now turn to some of the motivations that may underlie dishonest reporting of data.
MOTIVATIONS FOR FUDGING
As Reif (1961) pointed out, there is often intense competition among investigators deriving from a variety of factors which cause them to strive for prestige. Investigators commonly invest much time and effort in their research and they are not always neutral with respect to the results they obtain. Some investigators prefer that the results come out a certain way. Beck (1961) noted that, since investigators usually have a vested interest in the successful outcome of their research and feel the pressure to succeed or to blaze new trails, such biases generate error "and-let's face it, since science is done behind closed doors-dishonesty" (p. 219). After noting cases of fraud in science, Beck noted that:
What dishonesty exists among scientists is rarely on such a grand scale. It is subtle and, no doubt, frequently unconscious behavior. The experiments that 'work' are reported with no mention of those that failed. The data that support the hypothesis are seized upon; the rest are explained away or forgotten. (p. 220)
Hagstrom (1965) presented evidence that scientists are motivated to receive recognition, that this motivation influences the types of problems they tackle, and that scientists deny this motivation. Along similar lines, Glaser (1964) noted that the structure of science, which gives rise to competition for recognition, commonly gives rise to feelings of "comparative failure." Scientists who feel they have not received sufficient recognition, may indulge in deviant practices, such as "falsifying, plagiarizing, 'trimming off' bits of inconvenient data, selecting only those data that support one's hypothesis, and reporting only successful results" (Glaser, 1964, pp. 99-100).
The strong drive for recognition and fame among scientists was thoroughly documented by Watson (1968) in his book, which also described how the structure of DNA was discovered. The drive for fame has been present from the very beginning of science. Merton (1969) carefully documented the fact that Darwin, Faraday, Freud, Newton, and many other great scientists struggled and fought to receive recognition for priority of scientific discoveries and that the drive for priority is imbedded deeply in the scientific norm for originality. Merton (1957) also noted that the emphasis on originality in science has in some instances led to fraud, to fudging of data, to plagiarism and to one scientist slandering another. However, Merton also emphasized that the strong norm for honesty in science makes such cases rather rare.
The intense competition among scientists for fame, prestige, and credit for discoveries is widely documented. For instance, a physicist at the University of California recently sued two other physicists at the same university who had been awarded the Nobel prize for the discovery of the anti-proton. The suing physicist claimed that he had originated the seminal idea, designed an experiment to test it, and then revealed the design to the other two physicists. He also claimed that the latter two physicists did the experiment themselves, cut him out of participation, never gave him credit for his idea, and prevented him from doing anything about it for many years by threatening that if he did, they would deny him access to important equipment (the Bevatron) which was necessary for the conduct of his research (Anonymous, 1972b).
In brief, the striving for prestige or visibility among scientists can lead to bias since some investigators seem to believe that whether or not they report significant results can make a difference in their prestige, fame, or career. An investigator may believe, for instance that if he reports nonsignificant results, he will not be able to publish the report, he will not receive a research grant, or, if he is a doctoral candidate, he will not be granted the doctoral degree. Given this type of motivation to obtain significant results, it can be expected that some investigators may, for example, change one digit of a p value of, say, .1 5, to a p value of .05.
Since the hypothetical investigator discussed in the above paragraph is aware that he is violating a basic canon of scientific research - namely, to report the results correctly-, he may attempt to rationalize his fudging to himself by arguing that the effect is actually there or that the results are "significant" even though they do not reach an acceptable level of significance. He may rationalize to himself that reporting a p = .05 for his results is actually more representative of his data than reporting a nonsignificant p = .1 5.
MOTIVATIONS FOR HONESTY
As implied above, we might expect an Investigator Fudging Effect to occur at times when there is strong motivation to obtain certain results. However, as C.P. Snow (1961) has noted, the motivation to fudge which may be present under these conditions is strongly counterbalanced by a very strong motivation to adhere to the basic canon of research by reporting the results correctly. The motivation to report the results correctly is also strong since the investigator knows that if he is caught fudging his data, he will immediately be expelled from the fraternity of scientists and, if he is even suspected of fudging, he will be treated as a pariah by his colleagues. Tullock (1966) pointed out that the strongest reason why fudging is not more common in the natural sciences is that "fakery is almost certain to be detected, and the probability of detection is highly correlated with the importance of the result reported" (p. 134). Similarly, McCain and Segal (1969) have argued that "An additional restraint is that science, being public in nature, allows checking of data by uncommitted peers" (p. 117). Although the comments by Tullock (1966) and by McCain and Segal (1969) may be valid when applied to the natural sciences, it is questionable to what extent they are also applicable to the behavioral sciences. Since experiments in the natural sciences can often be replicated and cross-validated, fakery can usually be detected. However, since experiments in the behavioral sciences are very difficult to replicate and cross-validate, fakery is much more difficult to detect. If an investigator in the behavioral sciences is unable to cross-validate an earlier study, the author of the earlier study will very likely argue that there were some important differences in the procedure which led to the failure to replicate.
The motivation to adhere to the canons of scientific research is probably sufficient to prevent falsification of data on a "grand scale" in behavioral research. It is open to question, however, whether these canons are also strong enough to prevent "small scale" fudging in which the investigator alters his data or his statistics just enough to "round off the edges," to make his results more "acceptable" (for journal publication or for his colleagues) or to more closely fit the theory to which he is committed. In brief, although the conscience of the investigator and the consequences of being caught are sufficiently strong to prevent "large scale" fudging and probably to prevent "small scale" fudging in the overwhelming number of cases, it might also be expected that, in a few cases, the countermotivation to fudge, which derives from the investment in and the importance of obtaining certain results, finally wins out.
EXPERIMENTER EFFECTS
In the preceding chapters we discussed the pitfalls in research that are associated with the investigator. Before we now turn to the pitfalls associated with the experimenter, we should re-emphasize two points:
1. Even though the same person may be both an investigator and an experimenter, the two roles are functionally quite different. In much present-day research, investigators are highly paid professionals who design, analyze, interpret, and report studies, whereas, experlmenters are often graduate or undergraduate students who test the subjects while having only a peripheral involvement in the overall planning of the study.
2. One of the major contentions of this text is that the bias that has often been attributed to the lowly experimenter who runs the study is at times actually due to the high status investigator who has major responsibility for the study. Recent books pertaining to the pitfalls or artifacts in experimental research (Adair, 1973; Friedman, 1967; Jung, 1971; A.G. Miller, 1972; Rosenthal, 1966; Rosenthal & Rosnow, 1969) have overemphasized the pitfalls associated with the experimenter and have tended to downplay the many pitfalls that are associated with the investigator. In the previous chapters we tried to correct this imbalance by pinpointing some of the many ways that investigators influence the results of their studies. Let us now turn to the role of the experimenter and note some of the ways that he may affect the results.