Return to Bruce Thompson's Home Page

aeraaddr.wp1 4/3/98

Five Methodology Errors in Educational Research:
The Pantheon of Statistical Significance and Other Faux Pas

 

Bruce Thompson

Texas A&M University 77843-4225

and

Baylor College of Medicine

____________

Invited address (Divisions E, D, and C) presented at the annual meeting (session #25.66) of the American Educational Research Association, San Diego, April 15, 1998. The assistance of Xitao Fan, Utah State University, in running the LISREL structural equation modeling program as the general linear model, is appreciated. The author may be contacted through Internet URL:
http:www.coe.tamu.edu/~bthompson.


 

ABSTRACT

After presenting a general linear model as a framework for discussion, the present paper reviews five methodology errors that occur in educational research: (a) the use of stepwise methods; (b) the failure to consider in result interpretation the context specificity of analytic weights (e.g., regression beta weights, factor pattern coefficients, discriminant function coefficients, canonical function coefficients) that are part of all parametric quantitative analyses; (c) the failure to interpret both weights and structure coefficients as part of result interpretation; (d) the failure to recognize that reliability is a characteristic of scores, and not of tests; and (e) the incorrect interpretation of statistical significance and the related failure to report and interpret the effect sizes present in all quantitative analyses. In several cases small heuristic discriminant analysis data sets are presented to make more concrete and accessible the discussion of each of these five methodology errors.


     A well-known popular cliche holds that a chain is only as strong as its weakest link. So, too, a research study will be at least partially compromised by whatever is the weakest link in the sequence of activities that cumulate in a completed investigation. Too often the weakest link in contemporary quantitative educational research involves the methodologies of statistical analysis.

     There is no question that educational research, whatever its methodological and other limits, has influenced and informed educational practice (cf. Gage, 1985; Travers, 1983). But there seems to be some consensus that "too much of what we see in print is seriously flawed" as regards research methods, and that "much of the work in print ought not to be there" (Tuckman, 1990, p. 22). Gall, Borg and Gall (1996) concurred, noting that "the quality of published studies in education and related disciplines is, unfortunately, not high" (p. 151).

     Empirical studies of published research involving methodology experts as judges corroborate these holistic impressions. For example, Hall, Ward and Comer (1988) and Ward, Hall and Schramm (1975) found that over 40% and over 60%, respectively, of published research was judged by methods experts as being seriously or completely flawed. Wandt (1967) and Vockell and Asher (1974) reported similar results from their empirical studies of the quality of published research. Dissertations, too, have been examined, and too often have been found methodologically wanting (cf. Thompson, 1988a, 1994a).

     Of course, it must be acknowledged that even a methodologically flawed study may still contribute something to our understanding of educational phenomena. As Glass (1979) noted, "Our research literature in education is not of the highest quality, but I suspect that it is good enough on most topics" (p. 12).

     But the problem with methodologically flawed studies is that these methodological flaws are entirely gratuitous. There is no upside to conducting incorrect statistical analyses. Usually a more thoughtful analysis is not appreciably more demanding in time or expertise than is a compromised choice. Rather, incorrect analyses arise from doctoral methodology instruction that teaches research methods as series of rotely-followed routines, as against thoughtful elements of a reflective enterprise; from doctoral curricula that seemingly have less and less room for quantitative statistics and measurement content, even while our knowledge base in these areas is burgeoning (Aiken, West, Sechrest, Reno, with Roediger, Scarr, Kazdin & Sherman, 1990; Pedhazur & Schmelkin, 1991, pp. 2-3); and, in some cases, from an unfortunate atavistic impulse to somehow escape responsibility for analytic decisions by justifying choices, sans rationale, solely on the basis that the choices are common or traditional.

Purpose of the Paper

     The purpose of the present paper is to review five methodology errors that occur in educational research: (a) the use of stepwise methods; (b) the failure to consider in result interpretation the context specificity of analytic weights (e.g., regression beta weights, factor pattern coefficients, discriminant function coefficients, canonical function coefficients) that are part of all parametric quantitative analyses; (c) the failure to interpret both weights and structure coefficients as part of result interpretation; (d) the failure to recognize that reliability is a characteristic of scores, and not of tests; and (e) the incorrect interpretation of statistical significance and the related failure to report and interpret the effect sizes present in all quantitative analyses. These comments are not new to the literature, or even to my own writing. But the field has seemingly remained somewhat recalcitrant in reflecting evolution as regards these methodological issues.

     The paper presents a conceptual overview of each concern. In several cases small heuristic data sets are presented to make more concrete and accessible the discussion of each of these five methodology errors. Because, as will be shown, all parametric methods are part of one general linear model (GLM) family, methodology dynamics illustrated for one heuristic example generalize to other related cases. In the present paper, discriminant analysis examples are consistently (but arbitrarily) employed as heuristics. Nevertheless, the illustrations necessarily generalize to other analyses within the GLM family.

Delimitation

     Of course, methodological errors other than these five might have been cited. For example, empirical studies (Emmons, Stallings & Layne, 1990) show that, "In the last 20 years, the use of multivariate statistics has become commonplace" (Grimm & Yarnold, 1995, p. vii), probably for very good reasons (Fish, 1988; Thompson, 1984, 1994e). Many such studies employ MANOVA (all to the good), but an unfortunate number of these studies then use ANOVA methods post hoc to explore detected multivariate effects (all to the bad) (Borgen & Seling, 1978). As I have noted elsewhere,

The multivariate analysis evaluates multivariate synthetic variables, while the univariate analysis only considers univariate latent variables. Thus, univariate post hoc tests do not inform the researcher about the differences in the multivariate latent variables actually analyzed in the multivariate analysis... It is illogical to first declare interest in a multivariate omnibus system of variables, and to then explore detected effects in this multivariate world by conducting non-multivariate tests! (Thompson, 1994e, p. 14, emphasis in original)

     Similarly, all too often researchers erroneously interpret the eigenvalues in factor analysis as reflecting the variance contained in the individual factors after rotation (Thompson & Daniel, 1996a). Or the discarding of variance in order to conduct ANOVA (cf. Thompson, 1985) or incorrect use of ANCOVA (Thompson, 1992b) might have been discussed. However, space precludes discussion here of all possible common methodology errors; the present discussion necessarily must be delimited in some manner.

Premise Regarding Movement in Fields

     In considering these five methodology errors, it may be important for each of us to remember that, over the course of careers, fields, including the methodology-related fields, do move. Invariably, those of us in the late stages of our careers will confront the realization that some methodology choices in our own work, published decades earlier, no longer reflect standards of present best practice, or might even now be deemed fully inappropriate. Responsible scholars must remain open, and be willing to engage in continual reflection as to whether our own personal analytic traditions remain viable.

     Some have suggested that resistance to adopting revised methodological practice may in some cases be an artifact of denial, cognitive dissonance, and other classical psychological dynamics (Thompson, in press-d). For example, Schmidt and Hunter (1997) noted that "changing the beliefs and practices of a lifetime... naturally... provokes resistance" (Schmidt & Hunter, 1997, p. 49). Similarly, Rozeboom (1960) observed that "the perceptual defenses of psychologists are particularly efficient when dealing with matters of methodology, and so the statistical folkways of a more primitive past continue to dominate the local scene" (p. 417).

     Recognizing the reality that fields move, and that to be fair works must be evaluated primarily against the methodological standards contemporary at the time of a given report, may facilitate helpful change. Prior to advocating selected changes, however, the general linear model (GLM) will be briefly described so as to provide a unifying conceptual framework for the remaining discussion. Structural equation modeling (SEM) will be presented as the most general case of the general linear model (GLM).

Conceptual Framework: SEM as the General Linear Model (GLM)

     In one of his innumerable seminal contributions, the late Jacob "Jack" Cohen (1968) demonstrated that multiple regression subsumes all the univariate parametric methods as special cases, and thus provides a univariate general linear model that can be employed in all univariate analyses. Ten years later, in an equally important article Knapp (1978) presented the mathematical theory showing that canonical correlation analysis subsumes all the parametric analyses, both univariate and multivariate, as special cases. More concrete demonstrations of these relationships have also been offered (Fan, 1996; Thompson, 1984, 1991, in press-a). Both the Cohen (1968) and the Knapp (1978) articles were cited within a compilation of the most noteworthy methodology articles published during the last 50 years (Thompson & Daniel, 1996b).

     However, structural equation modeling (SEM) represents an even bigger conceptual tent subsuming more restrictive methods (Bagozzi, 1981). Instructive illustrations of these relationships have been offered by Fan (1997). Prior to extracting the conceptual implications of the realization that a general linear model underlies all parametric analyses, a concrete demonstration that SEM is a general linear model subsuming canonical correlation analysis (CCA) (and its multivariate and univariate special cases) may be useful.

Heuristic Illustration that SEM Subsumes CCA

     The illustration that SEM is a general linear model subsuming canonical correlation analysis (and its multivariate and univariate special cases) employs scores on seven variables (i.e., two in one set, and three in the other set) from the 301 cases in the Holzinger and Swineford (1939, pp. 81-91) data. These scores on ability batteries have classically been used as examples in both popular textbooks (Gorsuch, 1983, passim) and computer program manuals (Jöreskog & Sörbom, 1989, pp. 97-104), and thus are familiar to many readers.

     Table 1 presents the bivariate correlation matrix for these data. As in all parametric analyses, a correlation or covariance matrix is the basis for all analyses; this matrix is partitioned into quadrants (see Table 1) honoring the variables' membership in criterion or predictor sets, and is then subjected to a principal components analysis (Thompson, 1984, in press-a).

__________________________

INSERT TABLE 1 ABOUT HERE.

__________________________

     Appendix A presents the SPSS/LISREL computer program used to analyze the data. Table 2 presents the SPSS canonical correlation analysis of these same data.

__________________________

INSERT TABLE 2 ABOUT HERE.

__________________________

     Table 3 presents the relevant portions of the LISREL analysis of the canonical correlation model for these data. The LISREL coefficients for the "gamma" matrix exactly match (within rounding error) the SPSS canonical function coefficients presented in Table 2. The only exception is that all the signs for the SEM second canonical function coefficients must be "reflected." "Reflecting" a function (changing all the signs on a given function, factor, or equation) is always permissible, because the scaling of psychological constructs is arbitrary. Thus, the SEM and the canonical analysis derived the same results. Since SEM can be employed to test a CCA model, SEM is an even more general case of the general linear model, quod erat demonstrandum.

__________________________

INSERT TABLE 3 ABOUT HERE.

__________________________

Heuristic Implications

     There are a number of implications that can be drawn from the realization that a general linear model subsumes other methods as special cases. Specifically, all classical parametric methods are least squares procedures that implicitly or explicitly (a) use least squares weights (e.g., regression beta weights, standardized canonical function coefficients) to optimize explained variance and minimize model error variance, (b) focus on latent synthetic variables (e.g., the regression Y^ variable) created by applying the weights (e.g., beta weights) to scores on measured/observed variables (e.g., regression predictor variables), and (c) yield variance-accounted-for effect sizes analogous to r2 (e.g., R2, eta2, omega2). Thus, all classical analytic methods are correlational (Knapp, 1978; Thompson, 1988a).

     Designs may be experimental or correlational, but all analyses are correlational. Thus, an effect size analogous to r2 can be computed in any parametric analysis (see Snyder and Lawson (1993), or Kirk (1996)).

     The fact that all classical parametric methods use weights to then compute synthetic/latent variables by applying the weights to the measured/observed variables is obscured by the fact that most computer packages do not print the least squares weights that are actually invoked in ANOVA, for example, or when t-tests are conducted. Thus, some researchers unconsciously presume that such methods do not invoke optimal weighting systems.

     The fact that all classical parametric methods use weights to then compute synthetic/latent variables by applying the weights to the measured/observed variables is also obscured by the inherently confusing language of statistics. As I have noted elsewhere, the weights in different analyses

...are all analogous, but are given different names in different analyses (e.g., beta weights in regression, pattern coefficients in factor analysis, discriminant function coefficients in discriminant analysis, and canonical function coefficients in canonical correlation analysis), mainly to obfuscate the commonalities of [all] parametric methods, and to confuse graduate students. (Thompson, 1992a, pp. 906-907)

If all standardized weights across analytic methods were called by the same name (e.g., beta weights), then researchers might (correctly) conclude that all analyses are part of the same general linear model.

     Indeed, both the weight systems (e.g., regression equation, factor) and the synthetic variables (e.g., the regression Y^ variable) are also arbitrarily given different names across the analyses, again mainly so as to confuse the graduate students. Table 4 summarizes some of the elements of the very effective conspiracy.

__________________________

INSERT TABLE 4 ABOUT HERE.

__________________________

     The present paper will employ this general linear model as a unifying conceptual framework for some of the arguments made herein. However, prior to presenting these views, a brief digression is required.

Predictive Discriminant Analysis (PDA) as a Hybrid GLM Offshoot

     In the seminal work on discriminant analysis, Huberty (1994; see also Huberty and Barton (1989) and Huberty and Wisenbaker (1992)) thoughtfully distinguished two major applications: descriptive discriminant analysis (DDA) and predictive discriminant analysis (PDA). Put simply, DDA describes the differences on intervally-scaled "response" variables associated with a nominally-scaled variable, membership in different groups. PDA, on the other hand, uses intervally-scaled "response" variables to predict membership in different groups. Thus, the purpose of the analysis distinguishes the two methods (and these purposes subsequently determine which aspects of the results are relevant or irrelevant).

     The drawing of a distinction between DDA and PDA is not mere statistical nit-picking. Instead, the relevant aspects of DDA and PDA results are completely different. For example, in PDA the "hit rate" (and which response variables most contribute to the hit rate) is the sina qua non of the analysis, while the weights are generally irrelevant as regards result interpretation. In DDA, on the other hand, the weights and the "structure" of the synthetic/latent variable scores are very important to interpretation, but the concept of hit rate becomes irrelevant.

     The number of systems of weights (i.e., "functions," or "rules") also differs across DDA and PDA. In DDA, the number of linear discriminant functions (LDFs) is the number of groups minus one, or the number of response variables, whichever is smaller. In PDA, the number of linear classification functions (LCFs) is the number of groups. For example, with two groups and three response variables, in DDA there would be one LDF (and an associated set of scores on the synthetic variable, the discriminant scores). In the same case, in PDA there would be two LDFs (and associated sets of scores on the synthetic variables, the classification scores).

     PDA is a hybrid offshoot of the general linear model, while DDA resides fully within the GLM nuclear family. Thus, the conclusions reached here based on GLM concepts may not apply to the PDA case.

When More Variables Can Hurt Study Effects

     One powerful demonstration of PDA versus DDA dynamics involves a paradox. In any GLM analysis, more variables (e.g., more regression predictors) always lead to effect sizes (e.g., R2) that are equal to or greater than the effects associated with fewer variables. However, in PDA, more response variables can actually hurt the PDA hit rate.

     The Table 5 data, drawn from the Holzinger and Swineford (1939) data described previously, can be analyzed to illustrate these dynamics. The Appendix B SPSS program conducts the relevant analyses.

__________________________

INSERT TABLE 5 ABOUT HERE.

__________________________

     Table 6 presents the hit rates derived using three response variables as predictors using both LDF and LCF scores; these hit rates are both 66.4% ([40 + 31] / 107). [Normally only LCFs are used for classification purposes, even though SPSS incorrectly uses LDF scores for this purposes (Huberty & Lowman, 1997)]. Table 6 also presents the hit rates derived using four response variables as predictors using both LDF and LCF scores; these hit rates are both 63.6% ([38 + 30] / 107). Figure 1 presents the corresponding results in graphic form.

_______________________________________

INSERT TABLE 6 AND FIGURE 1 ABOUT HERE.

_______________________________________

     Indeed, the hit rate differences with the use of three versus four response variables is even greater than the apparent difference of 71 versus 68 people, respectively, being correctly classified. In fact, as noted in Table 7, 9 persons were classified differently across the analyses using three versus four response variables, even though the net impact of using more predictors was a net loss in predictive accuracy of three hits. [If the same data were treated as reflecting a DDA case, the Wilks lambda effect size would be the same or better (i.e., a smaller lambda value) for four (0.8050684) as against three (0.8094909) response variables, as is always true in the GLM case.]

__________________________

INSERT TABLE 7 ABOUT HERE.

__________________________

     Elsewhere I (Thompson, 1995b) have explained some of these counterintuitive dynamics by portraying a hypothetical set of results involving five response variables. Presume there were three "fence-riders," that is, cases very near the classification boundaries (arbitrarily cases #4, #11, and #51). Let's say with five predictor variables our initial lambda is .50, and let's say we add an additional, sixth response variable as a PDA predictor.

     Clearly, having more predictive information always help us better explain data dynamics, or at least can't take away what we already know. This is reflected by the fact that the Wilks lambda value will always stay the same or get better (i.e., smaller) as we add predictor variables.

     But this occurs only on the average, as reflected in on-the-average statistics such as lambda. While relative explanatory power will remain the same or improve on the average, at the case level each and every single case will not necessarily move toward its actual group's location when the additional sixth predictor variable is used. For example, let's say that all cases' positions except cases #4, #11, #51 and #43 remain fixed in essentially their initial locations and that group territorial boundaries also remain roughly unchanged.
      If because the sixth predictor was especially useful in locating case #43, case #43 might move very far toward but not over the boundary that would have yielded a correct classification. Lambda would reflect this change by getting better (i.e., smaller), such as changing from .50 to perhaps .45. Cases #4, #11, and #51 might move slightly away from their actual group, because although the sixth predictor will either not change explanatory power or will provide more information on the average, it is still possible that the sixth predictor may provide misinformation about these three particular cases, resulting in their moving across their actual group boundary and becoming misclassified. This small movement will, of course, be reflected in lambda, which will correspondingly get only slightly worse (i.e., bigger), such as moving from .45 to .46. Yet even though on the average locations have gotten more accurate and lambda has consequently improved from the original .50 to the final .46, the number of cases correctly classified when using all six predictors will have gotten worse by a net classification-accuracy change of minus three cases. (Thompson, 1995b, p. 345, emphasis in original)

Error #1: Using Stepwise Methods

     Huberty (1994) has noted that, "It is quite common to find the use of 'stepwise analyses' reported in empirically based journal articles" (p. 261). Huebner (1991, 1992) and Jorgenson, Jorgenson, Gillis and McCall (1993) are a few examples from among the many egregious reports of stepwise analyses.

     Stepwise methods continue to be used, notwithstanding scathing indictments of many of these applications (cf. Huberty, 1989; Snyder, 1991). My own feelings are intimated by the title of one of my editorials, viz. "Why won't stepwise methods die?" (Thompson, 1989).

     Three major problems with stepwise can be noted, and will be briefly summarized here. A more complete treatment is available in Thompson (1995c).

     The consequences of these three problems are quite serious. As Cliff (1987, p. 185) noted, "most computer programs for [stepwise] multiple regression are positively satanic in their temptations toward Type I errors." He also suggested that, "a large proportion of the published results using this method probably present conclusions that are not supported by the data" (pp. 120-121).

Wrong Degrees of Freedom

     First, most computer packages (and thus most researchers) use the wrong degrees of freedom in their statistical significance tests for stepwise methods, thus systematically always inflating the likelihood of obtaining statistically significant results. Degrees of freedom are the "coins" we pay to investigate the dynamics within our data. The statistical significance tests take into account both the number of coins we've chosen to spend and the number we have chosen to reserve.

     The most rigorous tests occur when we spend few degrees of freedom and reserve many. Conversely, at the extreme, all models with no degrees of freedom reserved (i.e., degrees of freedom error =0) always fit the data perfectly. For example, the bivariate r2 with n=2 inherently is always 1.0, as long as both X and Y are variables. Similarly, the multiple regression R2 with two predictors variables and n=3 inherently must always be 1.0.

     The computer packages conventionally charge degrees of freedom for the numerator (synonymously also called "model," "between," "regression," and "explained," to confuse the graduate students) that are a function of the number of response variables "entered" in the analysis at a given step. The remaining degrees of freedom (synonymously called "denominator," "residual," "error," "within," and "unexplained") are inversely related to the number of response variables "entered" in a given step.

     Table 8 illustrates these dynamics for a study involving 2 steps of stepwise analysis, with k=3 groups and n=120 people. Table 8 compares the results for two steps of analysis using the degrees of freedom calculations employed by SPSS and other computer packages, labelled "Incorrect," with the same calculations employing the correct degrees of freedom.

__________________________

INSERT TABLE 8 ABOUT HERE.

__________________________

     The differences in the analyses revolves around what "entered" means. The computer packages define "entered" or "used" as actually entered into the prediction equation. Thus, in step one the packages consider that only one predictor has been entered, while in step two the packages consider that two response variables have been entered.

     However, in this example each and every one of the 50 response variables was "used" at each and every one of the three steps, to decide which variable to enter at each step. The 49 or 48 unselected response variables may not have been retained in the analysis, but each one was examined, and played with, and actually tasted, prior to the leftovers then being returned to the cafeteria display case.

     This system of determining the degrees of freedom bill is analogous to only charging John Belushi in the movie Animal House for the food on his cafeteria tray, and charging nothing for what he has tasted and discarded. Clearly, this statistical package system of coinage is wrong. [Charging only for variables actually entered at each step would be appropriate, for example, if these response variables were randomly selected without first tasting each and every response variable.]

     It is instructive to see how using the wrong degrees of freedom in the numerator of the statistical significance testing calculations, and the wrong denominator df in the calculations, both bias the tests in favor of getting statistical significance. Table 8 illustrates how dramatic the effect of using the wrong degrees of freedom can be.

     After one step, the computer calculates that F(2,117) = 15.29841, with an associated probability of .0000012; the correct F(100,136) is 0.16751, with an associated probability of 1.00000. After the second step, the computer calculates that F(4,232) = 13.64322, with an associated probability of .0000945; the correct F(100,136) is 0.31991, with an associated probability of 1.00000. Obviously, the example illustrates that the correct and incorrect results can be night-vs-day different!

     Three factors determine exactly how egregiously the use of the wrong degrees of freedom distorts the stepwise results. The distortions are increasingly serious as (a) sample size is smaller, (b) the number of steps is larger, and (c) the number of response variables available to be selected is larger.

Nonreplicability of Results

     Second, stepwise methods tend to yield results that are sample-specific and do not generalize well to future studies. This is because stepwise requires a linear sequence of decisions, each of which is contingent upon all the previous decisions in the sequence. This is very much like walking through a maze--an incorrect decision at any point will lead to a cascade of subsequent decisions that each may themselves be wrong.

     Stepwise considers all differences of any magnitudes between variance explained by the response variables to be exact and true. Since there are usually numerous combinations of the response variables, and credit for variance explained for each partition of the variables may be influenced by sampling error, any small amount of sampling error anywhere in a single response variable can lead to disastrously erroneous choices in the linear sequence of stepwise selection decisions.

Stepwise Does NOT Identify the Best Variable Set of a Given Size

     Third, stepwise methods do not correctly identify the best set of predictors of a given response variable set size, k. For example, if one has 30 response variables, and does three steps of analysis, it is possible that the best predictor set of size k=3 will include none of the three variables selected after three steps of stepwise analysis of the same data, and that the three stepwise variables would also yield a lower effect size.

     This may seem counter-intuitive, but upon reflection, it should be easy to see that in fact stepwise analysis does not seek to identify the best variable set of a certain size. Stepwise simply does not ask the question, "What is the best predictor set of a given size?" This question requires simultaneously considering all the combinations of the variables that are possible for a given set size. Stepwise analysis never simultaneously considers all the combinations of the predictor variables. Rather, at each step stepwise analysis takes the previously entered variables as a given, and then asks which one change in the predictor set will most improve the prediction.

     Picking the best new variable in a sequence of selections is not the same as picking the best variable set of a given size. As Thompson (1995c) explained:

     Suppose one was picking a basketball team consisting of five players. The stepwise selection strategy picks the best potential player first, then the second best player in the context of the characteristics of the previously-selected first player, and so forth.
     An alternative strategy is an all-possible-subsets approach which asks, "which five potential players play together best as a team?". This team might conceivably contain exactly zero of the five players selected through the stepwise approach. Furthermore, this "best team" might be able to stomp the "stepwise team" by a considerable margin, because teams consisting of players of lesser abilities may still play together better as a team than players selected through a linear sequence of stepwise decisions. (pp. 528, 530, emphasis in original)

     The Table 9 data provide a powerful heuristic. Table 10 presents an abridged printout for these data involving two steps of stepwise DDA, conducted using the Appendix C SPSS program. In this analysis the stepwise algorithm selects response variables X1 and X2, and the lambda value is .6553991 (F(4,232)=13.64322).

__________________________________

INSERT TABLES 9 AND 10 ABOUT HERE.

__________________________________

     Compare the Table 10 results with those in Table 11. Table 11 presents the DDA results for all six possible combinations of the four response variables considered two at a time. Note that the best set of two variables (i.e., smallest lambda) involves response variables X3 and X4 (? = .6272538, F(4,232)=15.23292). The best variable set of size two contained neither of the two variables selected by the stepwise analysis!!!!!

___________________________

INSERT TABLE 11 ABOUT HERE.

___________________________

 

Error #2: Ignoring the Context Specificity of GLM Weights

     As noted previously, all univariate and multivariate methods apply weights to the measured variables to derive scores on the latent or synthetic variables that are actually the focus of all analyses. Consequently, if (and only if) noteworthy effects (e.g., R2, Rc2) are detected, it then becomes reasonable to consult the weights as part of the process of determining which response variables contributed to the detected effect. Indeed, some researchers have even taken the view that these weights (e.g., beta weights, standardized discriminant function coefficients) should be the sole basis for evaluating the importance of response variables (Harris, 1989).

     Unfortunately, overinterpretation of GLM weights is a serious threat. The weights can be greatly influenced by which variables are included or are excluded from a given analysis. Furthermore, Cliff (1987, pp. 177-178) noted that weights for a given set of variables may vary widely across samples, and yet consistently still yield the same effect sizes (i.e., be what he called statistically "sensitive"). Clearly weights are not the sole story in interpretation.

     Any interpretations of weights must be considered context-specific. Any change in the variables in the model can radically alter all of the weights. Too few researchers appreciate the potential magnitudes of these impacts.

     The Table 12 data illustrate these dynamics. The analysis contrasts using DDA models with either three response variables (i.e., X1, X2, and X3) or four response variables (i.e., X1, X2, X3, and X4). The example can be framed as either adding one response variable to an analysis involving three response variables, or deleting one response variable from an analysis involving four. This DDA example involves variance-covariance matrices for each of three groups that are exactly equal (called "homogeneity"), so the results are not confounded by failure to meet one of the assumptions of the analysis.

___________________________

INSERT TABLE 12 ABOUT HERE.

___________________________

     Table 13 presents an excerpt from an SPSS analysis of the Table 12 data conducted using the Appendix D computer program. Note the dramatic changes in the DDA standardized function coefficients. For example, with three response variables the first response variable, X1, had standardized function coefficients of 1.50086 and -.01817 on the two DDA functions. With four response variables X1 had standardized function coefficients of -.47343 and 1.22249 on the two DDA functions. Thus, the coefficients were quite variable in both magnitude and sign.

___________________________

INSERT TABLE 13 ABOUT HERE.

___________________________

     These fluctuations are not problematic, if (and only if) the researcher has selected exactly the right model (i.e., has not made what statisticians call a model specification error). But as Pedhazur (1982) has noted, "The rub, however, is that the true model is seldom, if ever, known" (p. 229). And as Duncan (1975) has noted, "Indeed it would require no elaborate sophistry to show that we will never have the 'right' model in any absolute sense" (p. 101).

     In other words, as a practical matter, the context-specificity of weights is always problematic, and the weights consequently must be interpreted cautiously. Some researchers acknowledge the vulnerability of the weights to sampling error influences (i.e., the so-called "bouncing beta" problem), but a more obvious concern is the context-specificity of the weights in the real-world context of full or partial model misspecification.

Error #3: Failing to Interpret

Both Weights and Structure Coefficients

     A response variable given a standardized weight of zero is being obliterated by the multiplicative weighting process, indicating either that (a) the variable has zero capacity to explain relationships among the variables or that (b) the variable has some explanatory capacity, but one or more other variables yield the same explanatory information and are arbitrarily (not wrongly, just arbitrarily) receiving all the credit for the variable's predictive power. Because a response variable may be assigned a standardized multiplicative weight of zero when (b) the variable has some explanatory capacity, but one or more other variables yield the same explanatory information and are arbitrarily (not wrongly, just arbitrarily) given all the credit for the variable's predictive power, it is essential to evaluate other coefficients in addition to standardized weights during interpretation, to determine the specific basis for the weighting.

     Just as it would be incorrect to evaluate predictor variables in a regression analysis only by consulting beta weights (Cooley & Lohnes, 1971, p. 55; Thompson & Borrello, 1985), in any GLM analysis it would be inappropriate to only consult standardized weights during result interpretation (Borgen & Seling, 1978, p. 692; Kerlinger & Pedhazur, 1973, p. 344; Levine, 1977, p. 20; Meredith, 1964, p. 55, Thompson, 1997b). Yet, some researchers do exactly that (cf. Humphries-Wadsworth, 1998).

     Under most circumstances standardized weights are not correlation coefficients. Thus, some of the weights in the Table 11 are less than -1 or are greater than +1. Structure coefficients, on the other hand, are always correlation coefficients, and reflect the linear relationship between scores on a given measured or observed variable with the scores on a given latent or synthetic variable. Thus, because synthetic variable are actually the focus of all parametric analyses, and because structure coefficients reveal the structure of these latent variables, the importance of structure coefficients seems obvious.

     Three possible cases can be delineated. The three illustrations demonstrate that jointly considering both standardized weights and structure coefficients indicates to the researcher which case is present in a given analysis. Appendix E presents the SPSS computer program used to analyze the three heuristic data sets.

Case #1: Function and Structure Coefficients are Equal

     In the special GLM case where measured variables are uncorrelated, the standardized weights in this case (and in this case only) are correlation coefficients. For example, in regression, if the predictor variables are uncorrelated, each predictor variable's beta weight equals that variable's product-moment correlation with the criterion variable. In discriminant analysis, the same principle applies if the "pooled" correlation matrix of the response variables indicates that the response variables are uncorrelated.

     Table 14 presents a hypothetical DDA data set illustrating this case for a k=3 group problem involving scores of n=30 people on each of p=3 response variables. As indicated by the Table 15 excerpt from the SPSS output for these data, in this special case the standardized function coefficients exactly equal the respective structure coefficients of the response variables.

___________________________________

INSERT TABLES 14 AND 15 ABOUT HERE.

___________________________________

 

Case #2: Measured Variables with Near-zero Weights Still Important

     As noted previously, measured variables may be assigned multiplicative weights of zero if the measured variable contains useful variance, but that variance is also present in some combination of the other measured variables. The researcher interpreting these results, especially if only standardized weights are interpreted, might erroneously conclude that such a response variable with a near-zero weight had essentially no utility in generating the observed effect. Instead, the result merely indicates that this variable is arbitrarily being denied credit for its potential contributions.

     Table 16 presents a relevant heuristic DDA data set for this case involving k=3 groups and p=3 response variables. Table 17 presents an excerpt from the related SPSS analysis of the tabled data.

___________________________________

INSERT TABLES 16 AND 17 ABOUT HERE.

___________________________________

     In this example, the standardized function coefficient on Function I for X3 was -.05507, while on the same function the other two response variables had standardized function coefficients of roughly +.95. Yet the squared structure coefficient (rS2 = .814312 = 66.3%) for X3 on the function indicates that X3 had more than twice the explanatory power as variables X1 (rS2 = .541412 = 29.3%) and X2 (rS2 = .564532 = 31.9%). Clearly, consulting only the function coefficients for this example would have resulted in a serious misinterpretation of results.

Case #3: "Suppressor" Effects

     The previous case makes clear that a measured variable assigned a zero or near-zero weight may nevertheless be an important variable, as reflected in the variable having a large non-zero structure coefficient. However, although it may seem counter-intuitive, a measured/observed variable may also have a zero or near-zero structure coefficient, and still be very important in defining a detected effect, as reflected in the variable having a non-zero standardized weight. [That is, only measured variables with both near-zero weights and near-zero structure coefficients are useless in defining a given detected effect.]

     Such a variable is classically termed a "suppressor" variable. However, although the name may feel pejorative, a "suppressor" variable actually increases the effect size, and so suppression is a good (and not a bad) thing. As defined by Pedhazur (1982, p. 104), in the related regression case, "A suppressor variable is a variable that has a zero, or close to zero, correlation with the criterion but is correlated with one or more than one of the predictor variables." Henard (1998) provides a nice overview of suppressor effects.

     Suppressor effects are quite difficult to explain in an intuitive manner. But Horst (1966) gave an example that is relatively accessible. He described the multiple regression prediction of pilot training success during World War II using mechanical, numerical, and spatial ability scores, each measured with paper and pencil tests. The verbal scores had very low correlations with the dependent variable, but had larger correlations with the other two predictors, since they were all measured with paper and pencil tests, i.e., measurement artifacts inflate correlations among traits measured with similar methods. As Horst (1966, p. 355) noted, "Some verbal ability was necessary in order to understand the instructions and the items used to measure the other three abilities."

     Including verbal ability scores in the regression equation in this example actually served to remove the contaminating influence of one predictor from the other predictors, which effectively increased the R2 value from what it would have been if only mechanical, numerical and spatial abilities had been used as predictors. The verbal ability variable had negative beta weights in the equation. As Horst (1966, p. 355) noted, "To include the verbal score with a negative weight served to suppress or subtract irrelevant ability, and to discount the scores [on the other predictors] of those who did well on the test simply because of their verbal ability rather than because of abilities required for success in pilot training." The fact that a measured variable unrelated to a measured criterion variable can still make important contributions in an analysis itself makes the very important point that the latent or synthetic variables analyzed in all parametric methods are always more than the sum of their constituent parts.

     Table 18 presents a relevant heuristic DDA data set for this case involving k=3 groups and p=3 response variables. Table 19 presents an excerpt from the related SPSS analysis of the tabled data. As reported in Table 19, on Function I DDA response variable X3 had a near-zero structure coefficient (rS = -.03464), but a large non-zero standardized function coefficient (i.e., -1.58393). Indeed, on this function X3 had the largest absolute standardized function coefficient, since X1 and X2 had standardized function coefficients of +1.22956 and +1.21174, respectively.

___________________________________

INSERT TABLES 18 AND 19 ABOUT HERE.

___________________________________

 

 

Error #4: Failing to Recognize that

Reliability Is Not a Characteristic of Tests

Nature of Score Reliability

     Misconceptions regarding the nature of reliability abound within the social sciences. For example, some researchers do not realize that, "Notwithstanding erroneous folkwisdom to the contrary, sometimes scores from shorter tests are more reliable than scores from longer tests" (Thompson, 1990, p. 586). In her important recent article, Vacha-Haase (1998a) cited the example of the Bem Sex-Role Inventory, noting that, "[i]n fact, the 20-item short-form of the Bem generally yields more reliable scores (rXX2 for the feminine scale ranging from .84 to .87) than does the 40-item long-form (rXX2 for the feminine scale ranging from .75 to .78)" (pp. 9-10).

     Misconceptions regarding reliability flourish in part because

[a]lthough most programs in sociobehavioral sciences, especially doctoral programs, require a modicum of exposure to statistics and research design, few seem to require the same where measurement is concerned. Thus, many students get the impression that no special competencies are necessary for the development and use of measures... (Pedhazur & Schmelkin, 1991, pp. 2-3)

Empirical study of doctoral curricula confirms this impression (Aiken et al., 1990).

     The most fundamental problem is that too few researchers act on a conscious recognition that reliability is a characteristic of scores or the data in hand, and not of tests. Test booklets are not impregnated with reliability during the printing process. The WISC that yields reliable scores for some adults on a given occasion of measurement will not necessarily do so when the same test is administered to first-graders.

     Many researchers recognize these dynamics on some level, but unconscious paradigm influences constrain too many researchers from actively integrating this presumption into their actual analytic practice. The pernicious practice of saying, "the test is reliable," creates a language that unconsciously predisposes researchers against acting on a conscious realization that tests themselves are not reliable (Thompson, 1994c). Reinhardt (1996) provides an excellent relevant review of reliability coefficients, and the factors that impact score reliability.

     As Rowley (1976, p. 53, emphasis added) argued, "It needs to be established that an instrument itself is neither reliable nor unreliable.... A single instrument can produce scores which are reliable, and other scores which are unreliable." Similarly, Crocker and Algina (1986, p. 144, emphasis added) argued that, "...A test is not 'reliable' or 'unreliable.' Rather, reliability is a property of the scores on a test for a particular group of examinees."

     In another widely respected text, Gronlund and Linn (1990, p. 78, emphasis in original) noted,

Reliability refers to the results obtained with an evaluation instrument and not to the instrument itself.... Thus, it is more appropriate to speak of the reliability of the "test scores" or of the "measurement" than of the "test" or the "instrument."

     And Eason (1991, p. 84, emphasis added) argued that:

Though some practitioners of the classical measurement paradigm [incorrectly] speak of reliability as a characteristic of tests, in fact reliability is a characteristic of data, albeit data generated on a given measure administered with a given protocol to given subjects on given occasions.

     The subjects themselves impact the reliability of scores, and thus it becomes an oxymoron to speak of "the reliability of the test" without considering to whom the test was administered, or other facets of each individual measurement protocol. Reliability is driven by variance--typically, greater score variance leads to greater score reliability, and so more heterogeneous samples often lead to more variable scores, and thus to higher reliability. Therefore, the same measure, when administered to more heterogenous or to more homogeneous sets of subjects, will yield scores with differing reliability. As Dawis (1987, p. 486) observed, "[b]ecause reliability is a function of sample as well as of instrument, it should be evaluated on a sample from the intended target population--an obvious but sometimes overlooked point."

     Our shorthand ways of speaking (e.g., language saying "the test is reliable") can itself cause confusion and lead to bad practice. As Pedhazur and Schmelkin (1991, p. 82, emphasis in original) observed, "Statements about the reliability of a measure are... inappropriate and potentially misleading." These telegraphic ways of speaking are not inherently problematic, but they often later become so when we come unconsciously to ascribe literal truth to our shorthand, rather than recognizing that our jargon is merely telegraphic and is not literally true. As noted elsewhere:

This is not just an issue of sloppy speaking--the problem is that sometimes we unconsciously come to think what we say or what we hear, so that sloppy speaking does sometimes lead to a more pernicious outcome, sloppy thinking and sloppy practice. (Thompson, 1992c, p. 436)

Implications for Practice

     These views suggest at least three implications for research practice. These practices are, unfortunately, not yet normative within the social sciences.

     Language Use. One fairly straightforward recommendation is that researchers should not use language saying that, "the test is reliable [or valid]," or that, "the reliability [or validity] of the test was .xx." Because on its face this language is inaccurate, and asserts untruth, it seems imprudent to use such language in scholarly discourse. The editorial policies of at least one journal commend better, correct practices:

Based on these considerations, use of wording such as "the reliability of the test" or "the validity of the test" will not be considered acceptable in the journal. Instead, authors should use language such as, "the scores in our study had a classical theory test-retest reliability coefficient of X," or "based on generalizability theory analysis, the scores in our study had a phi coefficient of X." Use of technically correct language will hopefully reinforce better practice. (Thompson, 1994c, p. 841)

     Coefficient Reporting. Researchers also ought to routinely report the reliability coefficients for their own data. Many do not do so now, because they act under the pernicious misconception that tests are reliable, and are therefore invariant across administrations.

     But it is sloppy practice to not calculate, report, and interpret the reliability of one's own scores for one's own data. As Pedhazur and Schmelkin (1991, p. 86, emphasis in original) argued:

Researchers who bother at all to report reliability estimates for the instruments they use (many do not) frequently report only reliability estimates contained in the manuals of the instruments or estimates reported by other researchers. Such information may be useful for comparative purposes, but it is imperative to recognize that the relevant reliability estimate is the one obtained for the sample used in the [present] study under consideration.

Unhappily, empirical studies indicate that such reports are infrequent (Meier & Davis, 1990; Willson, 1980) in most journals, although there are exceptions (Thompson & Snyder, in press).

     In her important paper proposing "reliability generalization" methods to characterize (a) the mean and (b) the standard deviation of score reliabilities for a given instrument across studies, and to explore (c) the sources of variability in score reliabilities, Vacha-Haase noted a benefit from the routine reporting of score reliability even in substantive studies:

Furthermore, if authors of empirical studies routinely report reliability coefficients, even in substantive studies, the field will cumulate more evidence regarding the psychometric integrity of scores. Such practices would provide more fodder for reliability generalization analyses focusing upon the differential influences of various sources of measurement error. (Vacha-Haase, 1998a, p. 14)

     Interpret Results in a Reliability Context. Effect sizes can and should be computed in all studies; Kirk (1996) and Snyder and Lawson (1993) provide excellent reviews of the many options. When and if these effects are deemed (a) noteworthy in magnitude and (b) replicable, then (and only then) these effect sizes should also be interpreted.

     Score reliability is one of the several study features that impact detected effects. Score measurement errors always attenuate computed effects to some degree (Schneider & Darcy, 1984). This attenuation ought to be considered when interpreting reported effects. As I have noted elsewhere,

The failure to consider score reliability in substantive research may exact a toll on the interpretations within research studies. For example, we may conduct studies that could not possibly yield noteworthy effect sizes, given that score reliability inherently attenuates effect sizes. Or we may not accurately interpret the effect sizes in our studies if we do not consider the reliability of the scores we are actually analyzing. (Thompson, 1994c, p. 840)

Error #5: Incorrectly Interpreting Statistical Significance;

Failing to Report Effect Sizes

     As Pedhazur and Schmelkin (1991) noted, "probably very few methodological issues have generated as much controversy" (p. 198) as have the use and interpretation of statistical significance tests. These tests have proven surprisingly resistant to repeated efforts "to exorcise the null hypothesis" (Cronbach, 1975, p. 124). Especially noteworthy among the historical efforts to accomplish the exorcism have been works by Rozeboom (1960), Morrison and Henkel (1970), Carver (1978), Meehl (1978), Shaver (1985), and Oakes (1986).

     More recently, a seemingly periodic series of articles on the extraordinary limits of statistical significance tests has been published in the American Psychologist (cf. Cohen, 1990, 1994; Kupfersmid, 1988; Rosenthal, 1991; Rosnow & Rosenthal, 1989). The entire Volume 61, Number 4 issue of the Journal of Experimental Education was devoted to these themes. Schmidt's (1996) APA Division 5 presidential address was published as the lead article in the second issue of the inagural volume of the new APA journal, Psycholgical Methods. The lead section (cf. Hunter, 1997) of the January, 1997 issue of Psychological Science was devoted to this controversy. The April, 1998 issue of Educational and Psychological Measurement featured two lengthy reviews (Levin, 1998; Thompson, 1998) of a major text (Harlow, Mulaik & Steiger, 1997) on the controversy. And the APA Task Force on Statistical Inference (Shea, 1996) has now been working for nearly two years on related recommendations for improving practices.

     Illustrative condemnations of contemporary statistical testing practices can be noted. For example, Schmidt and Hunter (1997) recently argued that "Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution" (p. 37). Rozeboom (1997) was equally direct:

Null-hypothesis significance testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students... [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism... (p. 335)

     But, without much question, two articles by the late Jacob Cohen (1990, 1994) have been the most influential. Roger Kirk (1996) characterized the two American Psychologist articles by Cohen as "classics," and argued that "the one individual most responsible for bringing the shortcomings of hypothesis testing to the attention of behavioral and educational researchers is Jacob Cohen" (p. 747).

     This onslaught of criticism has provoked reactive advocacy for statistical tests (cf. Cortina & Dunlap, 1997; Frick, 1996; Greenwald, Gonzalez, Harris & Guthrie, 1996; Hagen, 1997; Robinson & Levin, 1997). Some of these treatments have been thoughtful, but others have been seriously flawed (see Thompson, in press-c, in press-d).

     Yet, notwithstanding the long-term availability of these many publications, even today some researchers still do not understand what their statistical significance tests do and do not do. Empirical studies of researcher perceptions of test results confirm that researchers manifest these misconceptions (cf. Nelson, Rosenthal & Rosnow, 1986; Oakes, 1986; Rosenthal & Gaito, 1963; Zuckerman, Hodgins, Zuckerman & Rosenthal, 1993). Similarly, content reviews of the most widely-used statistics textbooks show that even our most distinguished methodologists do not have a good grasp on the meaning of statistical significance tests (Carver, 1978).

     My own views have been articulated in various locations (e.g., Thompson, 1993, 1994d, 1997a, in press-a, in press-d). I believe that three other essays (Thompson, 1996, 1998, in press-b) are particularly noteworthy. And a short, public-domain ERIC Digest I published (Thompson, 1994b) may be very useful as a class handout.

     I have never argued that significance tests should be banned, though obviously others have argued that view (cf. Carver, 1978; Schmidt & Hunter, 1997). As an author, I do report (without much excitement) the results of statistical significance tests. As an editor of three journals, I have accepted for publication manuscripts that report these tests.

Common Misconceptions Regarding Statistical Tests

     In various locations I have criticized common misconceptions regarding the meaning and value of statistical tests (cf. Thompson, 1996, in press-b). Three of these I now briefly summarize here.

     Statistical Significance Does Not Test Result Importance. Put simply, improbable events are not intrinsically interesting. Some highly improbable events, in fact, are completely inconsequential. In his classic hypothetical dialogue between two teachers, Shaver (1985, p. 58) poignantly illustrated the folly of equating result improbability with result importance:

Chris: ...I set the level of significance at .05, as my advisor suggested. So a difference that large would occur by chance less than five times in a hundred if the groups weren't really different. An unlikely occurrence like that surely must be important.

Jean: Wait a minute, Chris. Remember the other day when you went into the office to call home? Just as you completed dialing the number, your little boy picked up the phone to call someone. So you were connected and talking to one another without the phone ever ringing... Well, that must have been a truly important occurrence then?

     Even more importantly, since the premises of statistical significance tests do not invoke human values, in valid logical argument statistical results therefore can not under any circumstances contain as part of their conclusions information about result value. As I have noted previously, "If the computer package did not ask you your values prior to its analysis, it could not have considered your value system in calculating p's, and so p's cannot be blithely used to infer the value of research results" (Thompson, 1993, p. 365). Thus, statistical tests cannot reasonably be used as an atavistic escape from responsibility for defending result importance (Thompson, 1993), or to maintain a mantle of feigned objectivity (Thompson, in press-b).

     Statistical Significance Does Not Test Result Replicability. Social scientists seek to identify relationships that recur under stated conditions. Discovering analogs of cold fusion will make us extremely popular (free drinks, much dancing, etc.) at our next scholarly meeting, but we will eternally thereafter be shunned (no one will accept the drinks we attempt to buy for them, so much for the dancing, etc.) at all future conferences, once our results are discovered to be non-replicable. [So, only report non-replicable results at your last conference, immediately prior to retirement.]

     Too many researchers, consciously or unconsciously, incorrectly assume that the p values calculated in statistical significance tests evaluate the probability that results will replicate (Carver, 1978, 1993). But statistical tests do not evaluate the probability that the sample statistics occur in the population as parameters (Cohen, 1994).

     Instead, "pCALCULATED is the probability (0 to 1.0) of the sample statistics, given the sample size, and assuming the sample was derived from a population in which the null hypothesis (H0) is exactly true" (Thompson, 1996, p. 27). Obviously, knowing the probability of the sample is less interesting than knowing the probability of the population. Knowing the probability of population parameters would bear upon result replicability, since we would then know something about the population from which future researchers would also draw their samples.

     But as Shaver (1993) argued so emphatically:

[A] test of statistical significance is not an indication of the probability that a result would be obtained upon replication of the study.... Carver's (1978) treatment should have dealt a death blow to this fallacy.... (p. 304)

And so Cohen (1994) concluded that the statistical significance test "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" (p. 997).

     Statistical Significance Does Not Solely Evaluate Effect Magnitude. Because various study features (including score reliability) impact calculated p values, pCALCULATED cannot be used as a satisfactory index of study effect size. As I have noted elsewhere,

The calculated p values in a given study are a function of several study features, but are particularly influenced by the confounded, joint influence of study sample size and study effect sizes. Because p values are confounded indices, in theory 100 studies with varying sample sizes and 100 different effect sizes could each have the same single pCALCULATED, and 100 studies with the same single effect size could each have 100 different values for pCALCULATED. (Thompson, in press-b)

     The recent fourth edition of the American Psychological Association style manual (APA, 1994) explicitly acknowledges that p values are not acceptable indices of effect:

Neither of the two types of probability values [statistical significance tests] reflects the importance or magnitude of an effect because both depend on sample size... You are [therefore] encouraged to provide effect-size information. (APA, 1994, p. 18, emphasis added)

Recommended Improvements in Statistical Testing Practices

     In various locations (cf. Thompson, 1996, in press-b) I have advocated certain changed practices as regards the use of statistical tests. Five such suggested changes are now summarized here.

     Effect Sizes Should Be Reported for All Tested Effects. The single most important potential improvement in analytic practice would be the regular and routine reporting of effect sizes in all studies. As noted previously, such reports are at least "encouraged" by the new APA (1994, p. 18) style manual.

     However, empirical studies of articles published since 1994 in psychology, counseling, special education, and general education suggest that merely "encouraging" effect size reporting (APA, 1994) has not appreciably affected actual reporting practices (e.g., Kirk, 1996; Snyder & Thompson, in press; Thompson & Snyder, 1997, in press; Vacha-Haase & Nilson, in press). An on-going series of additional empirical studies of reporting practices has yielded similar results for yet more journals (Lance & Vacha-Haase, 1998; Ness & Vacha-Haase, 1998; Nillson & Vacha-Haase, 1998; Reetz & Vacha-Haase, 1998).

     Effect sizes are important to report for at least two reasons. First, when these effects are noteworthy, these indices inform judgment regarding the practical or substantive significance of results (cf. Kirk, 1996). Second, reporting all effect sizes (even non-statistically significant effects, though some might not interpret them) facilitates the meta-analytic integration of findings across a given literature.

     There are many effect sizes (e.g., "uncorrected," "corrected," standardized differences) that can be computed (cf. Kirk, 1996; Snyder & Lawson, 1993). In my view (Thompson, in press-b), arguments can be made that certain indices should be preferred over others. But the important point is that, as regards effect size reporting, it is generally better to report anything as against nothing, which is the effect size that most researchers currently report.

     Of course, an effect size is no more magical than is statistical significance testing, for the two reasons noted by Zwick (1997). First, because human values are also not part of the calculation of an effect size, any more than values are part of the calculation of p, "largeness of effect does not guarantee practical importance any more than statistical significance does" (p. 4).

     Second, some researchers have too rigidly adopted Cohen's (1988) definitions of small, medium and large effects, just as some researchers too rigidity adopted "?=.05" as their gold standard. Cohen (1988) only intended these as impressionistic characterizations of result typicality across a diverse literature. However, some empirical studies do suggest that the characterization is reasonably accurate (Glass, 1979; Olejnik, 1984), at least as regards a literature historically built with a bias against statistically non-significant results (Rosenthal, 1979).

     In my view, editorial requirements (Vacha-Haase, 1998b) will ultimately be required to move the field to change analytic and reporting practices. Fortunately, editorial policies at some journals now require authors to report and interpret effect sizes. For example, the author guidelines of the Journal of Experimental Education indicate that "authors are required to report and interpret magnitude-of-effect measures in conjunction with every p value that is reported" (Heldref Foundation, 1997, pp. 95-96, emphasis added). I believe the EPM author guidelines are equally informed:

We will go further [than mere encouragement]. Authors reporting statistical significance will be required to both report and interpret effect sizes. However, these effect sizes may be of various forms, including standardized differences, or uncorrected (e.g., r2, R2, eta2) or corrected (e.g., adjusted R2, omega2) variance-accounted-for statistics. (Thompson, 1994c, p. 845, emphasis in original)

It is particularly noteworthy that editorial policies even at one APA journal now indicate that:

If an author decides not to present an effect size estimate along with the outcome of a significance test, I will ask the author to provide specific justification for why effect sizes are not reported. So far, I have not heard a good argument against presenting effect sizes. Therefore, unless there is a real impediment to doing so, you should routinely include effect size information in the papers you submit. (Murphy, 1997, p. 4)

     Researchers Should More Frequently Employ Non-Nill Nulls. An important but overlooked (see Hagen, 1997; Thompson, in press-c) element of Cohen's (1994) classic article involved his striking criticism of the routine use of "nil" null hypotheses. Cohen (1994) defined a "nil" null hypothesis as a null specifying no differences (e.g., SD1-SD2 = 0) or zero correlations (e.g., R2=0).

     Some researchers employ nil nulls because statistical theory does not easily accommodate the testing of some non-nil nulls. But in other cases researchers employ nil nulls because these nulls have been unconsciously accepted as traditional, because these nulls can be mindlessly formulated without consulting previous literature, or because most computer software defaults to tests of nil nulls (Thompson, 1998, in press-b, in press-c).

     Unfortunately, when a statistical significance test presumes a nil null is true in the population, an untruth is posited. As Meehl (1978, p. 822) noted, "As I believe is generally recognized by statisticians today and by thoughtful social scientists, the null hypothesis, taken literally, is always false." Similarly, Hays (1981, p. 293) pointed out that "[t]here is surely nothing on earth that is completely independent of anything else [in the population]. The strength of association may approach zero, but it should seldom or never be exactly zero."

     Highly respected statistician Roger Kirk (1996) put the point succinctly in his important recent article:

Because the null hypothesis is always false, a decision to reject it simply indicates that the research design had adequate power to detect a true state of affairs, which may or may not be a large effect or even a useful effect. It is ironic that a ritualistic adherence to null hypothesis significance testing has led researchers to focus on controlling the Type I error that cannot occur because all null hypotheses are false. (p. 747, emphasis added)

And a pCALCULATED value computed on the foundation of a false premise is inherently of somewhat limited utility.

     There is a very important implication of the realization that the nil null is untrue in the population. As Hays (1981, p. 293) emphasized, because the nil null is untrue in the population, sample statistics should reflect some difference or some effect, and thus "virtually any study can be made to show significant results if one uses enough subjects." This means that

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data from hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they're tired. (Thompson, 1992c, p. 436)

Statistical significance would be considerably more informative if researchers reviewed relevant previous research, and then constructed hypotheses that incorporated previous results.

     Measurement Results Should be Tested with Non-Nil Nulls. There is growing recognition that some uses of statistical tests in measurement studies, as regards reliability or validity coefficients or construct validity tests of means, can be particularly misguided. For example, Abelson (1997) commented on statistical tests of measurement study results using nil null hypotheses:

And when a reliability coefficient is declared to be nonzero, that is the ultimate in stupefyingly vacuous information. What we really want to know is whether an estimated reliability is .50'ish or .80'ish. (Abelson, 1997, p. 121)

Fortunately, the author guidelines of some journals have become more enlightened as regards such practices:

Statistical tests of such coefficients in a measurement context make little sense. Either statistical significance tests using the [nil] null hypothesis of zero magnitude should be by-passed, or meaningful null hypotheses should be employed. (Thompson, 1994c, p. 844)

     Researchers Should Provide Some Warrant That Results Are Replicable. Because evidence of result replicability is important (if we take science to be the business of cumulating knowledge across studies), because statistical significance tests do not evaluate result replicability (Cohen, 1994; Thompson, 1996, 1997b), other methods must and should be used for this purpose. It has been suggested that

As more researchers finally realize that statistical significance tests do not test the population, and therefore do not test replicability, researchers will increasingly emphasize evidence that instead is relevant to the issue of result replicability. (Vacha-Haase & Thompson, in press)

Many warrants are available, and in fact a single study might present several such warrants.

     The most persuasive, and perhaps the only conclusive, evidence for result replicability is to actually replicate the study. And replication studies are important, and probably are somewhat undervalued in the social sciences (Robinson & Levin, 1997). However, many researchers (especially doctoral students working on dissertations and junior faculty seeking tenure) find themselves unable to replicate every study.

     One potential warrant for replicability would involve prospectively formulating null hypotheses by reflectively consulting the effect sizes reported in previous related studies, and by prospectively interpreting study effects in the context of specific previous findings. In effect, virtually any study might be conducted and interpreted as a partial replication of previous inquiry. Another alternative warrant involves empirical investigation of replicability by conducting what I have termed (cf. Thompson, 1996) "internal" replicability analyses.

     "Internal" replicability analyses empirically use the sample in hand to combine the participants in different ways to estimate how much the idiosyncracies of individuality within the sample have compromised generalizability. The major "internal" empirical replicability analyses are cross-validation, the jackknife, and the bootstrap (Diaconis & Efron, 1983); the logics are reviewed in more detail elsewhere (cf. Thompson, 1993, 1994d). "Internal" evidence for replicability is never as good as an actual replication (Robinson & Levin, 1997; Thompson, 1997a), but is certainly better than incorrectly presuming that statistical significance assures result replicability.

     However, it must be emphasized that the inferential and the descriptive uses of these logics should not be confused (Thompson, 1993). For example, the inferential use of the bootstrap involves using the bootstrap to estimate a sampling distribution when the sampling distribution is not known or assumptions for the use of a known sampling distribution cannot be met (i.e., to conduct a different form of statistical significance test). The descriptive use of the bootstrap looks primarily at the variability in effect sizes or other parameter estimates across many different combinations of the participants. The software to conduct "internal" bootstrap analyses for statistics commonly used in the social sciences (cf. Elmore & Woehlke, 1988; Goodwin & Goodwin, 1985) is already widely available (e.g., Lunneborg (1987) for univariate applications, and Thompson (1988b, 1992a, 1995a) for multivariate applications).

     Improved Language Use. In Thompson (1996), I suggested that when the null hypothesis is rejected, "such results ought to always be described as 'statistically significant,' and should never be described only as 'significant'" (pp. 28-29). My argument (Thompson, 1996, 1997a; but see Robinson & Levin, 1997) has been that the common meaning of "significant" has nothing to do with the statistical use of this term, and that the use of the complete phrase might help at least some in conveying that this technical phrase has nothing to do with result importance.

     Carver (1993) eloquently made the same argument:

When trying to emulate the best principles of science, it seems important to say what we mean and to mean what we say. Even though many readers of scientific journals know that the word significant is supposed to mean statistically significant when it is used in this context, many readers do not know this. Why be unnecessarily confusing when clarity should be most important? (p. 288, emphasis in original)

Summary

     After presenting a general linear model as a framework for discussion, the present paper reviewed five methodology errors that occur in educational research: (a) the use of stepwise methods; (b) the failure to consider in result interpretation the context specificity of analytic weights (e.g., regression beta weights, factor pattern coefficients, discriminant function coefficients, canonical function coefficients) that are part of all parametric quantitative analyses; (c) the failure to interpret both weights and structure coefficients as part of result interpretation; (d) the failure to recognize that reliability is a characteristic of scores, and not of tests; and (e) the incorrect interpretation of statistical significance and the related failure to report and interpret the effect sizes present in all quantitative analyses. In several cases small heuristic discriminant analysis data sets were presented to make more concrete and accessible the discussion of each of these five methodology errors.

     However, of the various arenas for improvement, the one where I believe the most progress could be realized involves the use of statistical significance tests and the reporting of effect sizes. Yet this is where the most resistance has seemingly occurred. For example, Schmidt and Hunter (1997) recently argued that "logic-based arguments seem to have had only a limited impact... [perhaps due to] the virtual brainwashing in significance testing that all of us have undergone" (pp. 38-39). They also spoke of a "psychology of addiction to significance testing" (Schmidt & Hunter, 1997, p. 49).

     Journal editor Loftus (1994), like others, has lamented that repeated publications of

these concerns never seem to attract much attention (much less impel action). They are carefully crafted and put forth for consideration, only to just kind of dissolve away in the vast acid bath of our existing methodological orthodoxy. (p. 1)

Another editor commented: "p values are like mosquitos" that apparently "have an evolutionary niche somewhere and [unfortunately] no amount of scratching, swatting or spraying will dislodge them" (Campbell, 1982, p. 698).

     Similar comments have been made by non-editors. For example, Falk and Greenbaum (1995) noted that "A massive educational effort is required to... extinguish the mindless use of a procedure that dies hard" (p. 94). And Harris (1991) observed, "it is surprising that the dragon will not stay dead" (p. 375).

     Fortunately, some slow, glacial progress in the incremental movement of the field was reflected in the APA (1994, p. 18) style manual "encouraging" the reporting of effect sizes. But enlightened editorial policies (e.g., Heldref Foundation, 1997; Murphy, 1997; Thompson, 1994c) now provide the strongest basis for cautious optimism.


References

Abelson, R.P. (1997). A retrospective on the significance test ban of 1999 (If there were no significance tests, they would be invented). In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 117-141). Mahwah, NJ: Erlbaum.

Aiken, L.S., West, S.G., Sechrest, L., Reno, R.R., with Roediger, H.L., Scarr, S., Kazdin, A.E., & Sherman, S.J. (1990). The training in statistics, methodology, and measurement in psychology. American Psychologist, 45, 721-734.

American Psychological Association. (1994). Publication manual of the American Psychological Association (4th ed.). Washington, DC: Author.

Bagozzi, R.P. (1981). Canonical correlation analysis as a special case of a structural relations model. Multivariate Behavioral Research, 16, 437-454.

Borgen, F.H., & Seling, M.J. (1978). Uses of discriminant analysis following MANOVA: Multivariate statistics for multivariate purposes. Journal of Applied Psychology, 63(6), 689-697.

Campbell, N. (1982). Editorial: Some remarks from the outgoing editor. Journal of Applied Psychology, 67, 691-700.

Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399.

Carver, R. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61, 287-292.

Cliff, N. (1987). Analyzing multivariate data. San Diego: Harcourt Brace Jovanovich.

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426-443.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Cooley, W.W., & Lohnes, P.R. (1971). Multivariate data analysis. New York: John Wiley & Sons.

Cortina, J.M., & Dunlap, W.P. (1997). Logic and purpose of significance testing. Psychological Methods, 2, 161-172.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart and Winston.

Cronbach, L.J. (1975). Beyond the two disciplines of psychology. American Psychologist, 30, 116-127.

Dawis, R.V. (1987). Scale construction. Journal of Counseling Psychology, 34, 481-489.

Diaconis, P., & Efron, B. (1983). Computer-intensive methods in statistics. Scientific American, 248(5), 116-130.

 

____________

* Cited empirical studies of methodological practice are designated with asterisks.

 

Duncan, O.D. (1975). Introduction to structural equation models. New York: Academic Press.

Eason, S. (1991). Why generalizability theory yields better results than classical test theory: A primer with concrete examples. In B. Thompson (Ed.), Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 83-98). Greenwich, CT: JAI Press.

Elmore, P.B., & Woehlke, P.L. (1988). Statistical methods employed in American Educational Research Journal, Educational Researcher, and Review of Educational Research from 1978 to 1987. Educational Researcher, 17(9), 19-20.

*Emmons, N.J., Stallings, W.M., & Layne, B.H. (1990, April). Statistical methods used in American Educational Research Journal, Journal of Educational Psychology, and Sociology of Education from 1972 through 1987. Paper presented at the annual meeting of the American Educational Research Association, Boston, MA. (ERIC Document Reproduction Service No. ED 319 797)

Falk, R., & Greenbaum, C.W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology, 5(1), 75-98.

Fan, X. (1996). Canonical correlation analysis as a general analytic model. In B. Thompson (Ed.), Advances in social science methodology (Vol. 4, pp. 71-94). Greenwich, CT: JAI Press.

Fan, X. (1997). Canonical correlation analysis and structural equation modeling: What do they have in common? Structural Equation Modeling, 4, 65-79.

Fish, L.J. (1988). Why multivariate methods are usually vital. Measurement and Evaluation in Counseling and Development, 21, 130-137.

Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.

Gage, N.L. (1985). Hard gains in the soft sciences: The case of pedagogy. Bloomington, IN: Phi Delta Kappa Center on Evaluation, Development, and Research.

Gall, M.D., Borg, W.R., & Gall, J.P. (1996). Educational research: An introduction (6th ed.). White Plains, NY: Longman.

*Glass, G.V (1979). Policy for the unpredictable (uncertainty research and policy). Educational Researcher, 8(9), 12-14.

Goodwin, L.D., & Goodwin, W.L. (1985). Statistical techniques in AERJ articles, 1979-1983: The preparation of graduate students to read the educational research literature. Educational Researcher, 14(2), 5-11.

Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.

Greenwald, A.G., Gonzalez, R., Harris, R.J., & Guthrie, D. (1996). Effect size and p-values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175-183.

Grimm, L.G., & Yarnold, P.R. (Eds.). (1995). Reading and understanding multivariate statistics. Washington, DC: American Psychological Association.

Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmillan.

Hagen, R.L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.

*Hall, B.W., Ward, A.W., & Comer, C.B. (1988). Published educational research: An empirical study of its quality. Journal of Educational Research, 81, 182-189.

Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997). What if there were no significance tests?. Mahwah, NJ: Erlbaum.

Harris, M.J. (1991). Significance tests are not enough: The role of effect-size estimation in theory corroboration. Theory & Psychology, 1, 375-382.

Harris, R.J. (1989). A canonical cautionary. Multivariate Behavioral Research, 24, 17-39.

Hays, W. L. (1981). Statistics (3rd ed.). New York: Holt, Rinehart and Winston.

Heldref Foundation. (1997). Guidelines for contributors. Journal of Experimental Education, 65, 95-96.

Henard, D.H. (1998, January). Suppressor variable effects: Toward understanding an elusive data dynamic. Paper presented at the annual meeting of the Southwest Educational Research Association, Houston. (ERIC Document Reproduction Service No. ED forthcoming)

Holzinger, K. L. & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution (No. 48). Chicago: University of Chicago.

Horst, P. (1966). Psychological measurement and prediction. Belmont, CA: Wadsworth.

Huberty, C.J (1989). Problems with stepwise methods--better alternatives. In B. Thompson (Ed.), Advances in social science methodology (Vol. 1, pp. 43-70). Greenwich, CT: JAI Press.

Huberty, C.J (1994). Applied discriminant analysis. New York: Wiley and Sons.

Huberty, C.J, & Barton, R. (1989). An introduction to discriminant analysis. Measurement and Evaluation in Counseling and Development, 22, 158-168.

Huberty, C.J, & Lowman, L.L. (1997). Discriminant analysis via statistical packages. Educational and Psychological Measurement, 57, 759-784.

Huberty, C.J, & Wisenbaker, J. (1992). Discriminant analysis: Potential improvements in typical practice. In B. Thompson (Ed.), Advances in social science methodology (Vol. 2, pp. 169-208). Greenwich, CT: JAI Press.

Huebner, E. S. (1991). Correlates of life satisfaction in children. School Psychology Quarterly, 6, 103-111.

Huebner, E. S. (1992). Burnout among school psychologists: An exploratory investigation into its nature, extent, and correlates. School Psychology Quarterly, 7, 129-136.

Humphries-Wadsworth, T.M. (1998, April). Features of published analyses of canonical results. Paper presented at the annual meeting of the American Educational Research Association, San Diego. (ERIC Document Reproduction Service No. ED forthcoming)

Hunter, J.E. (1997). Needed: A ban on the significance test. Psychological Science, 8(1), 3-7.

Jöreskog, K.G., & Sörbom, D. (1989). LISREL 7: A guide to the program and applications (2nd ed.). Chicago: SPSS.

Jorgenson, C. B., Jorgenson, D. E., Gillis, M. K., & McCall, C. M. (1993). Validation of a screening instrument for young children with teacher assessment of school performance. School Psychology Quarterly, 8, 125-139.

Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart and Winston.

*Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.

Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin, 85, 410-416.

Kupfersmid, J. (1988). Improving what is published: A model in search of an editor. American Psychologist, 43, 635-642.

*Lance, T., & Vacha-Haase, T. (1998, August). The Counseling Psychologist: Trends and usages of statistical significance testing. Paper presented at the annual meeting of the American Psychological Association, San Francisco.

Levin, J.R. (1998). To test or not to test H0? Educational and Psychological Measurement, 58, 311-331.

Levine, M. S. (1977). Canonical analysis and factor comparison. Newbury Park, CA: Sage.

Loftus, G.R. (1994, August). Why psychology will never be a real science until we change the way we analyze data. Paper presented at the annual meeting of the American Psychological Association, Los Angeles.

Lunneborg, C.E. (1987). Bootstrap applications for the behavioral sciences. Seattle: University of Washington.

Meehl, P.E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

*Meier, S.T., & Davis, S.R. (1990). Trends in reporting psychometric properties of scales used in counseling psychology research. Journal of Counseling Psychology, 37, 113-115.

Meredith, W. (1964). Canonical correlations with fallible data. Psychometrika, 29, 55-65.

Morrison, D.E., & Henkel, R.E. (Eds.). (1970). The significance test controversy. Chicago: Aldine.

Murphy, K.R. (1997). Editorial. Journal of Applied Psychology, 82, 3-5.

*Nelson, N., Rosenthal, R., & Rosnow, R.L. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist, 41, 1299-1301.

*Ness, C., & Vacha-Haase, T. (1998, August). Statistical significance reporting: Current trends and usages within Professional Psychology: Research and Practice. Paper presented at the annual meeting of the American Psychological Association, San Francisco.

*Nillson, J., & Vacha-Haase, T. (1998, August). A review of statistical significance reporting in the Journal of Counseling Psychology. Paper presented at the annual meeting of the American Psychological Association, San Francisco.

*Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley.

*Olejnik, S.F. (1984). Planning educational research: Determining the necessary sample size. Journal of Experimental Education, 53, 40-48.

Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction (2nd ed.). New York: Holt, Rinehart and Winston.

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Erlbaum.

*Reetz, D., & Vacha-Haase, T. (1998, August). Trends and usages of statistical significance testing in adult development and aging research: A review of Psychology and Aging. Paper presented at the annual meeting of the American Psychological Association, San Francisco.

Reinhardt, B. (1996). Factors affecting coefficient alpha: A mini Monte Carlo study. In B. Thompson (Ed.), Advances in social science methodology (Vol. 4, pp. 3-20). Greenwich, CT: JAI Press.

Robinson, D., & Levin, J. (1997). Reflections on statistical and substantive significance, with a slice of replication. Educational Researcher, 26(5), 21-26.

Rosenthal, R. (1979). The "file drawer problem" and tolerance for null results. Psychological Bulletin, 86, 638-641.

Rosenthal, R. (1991). Effect sizes: Pearson's correlation, its display via the BESD, and alternative indices. American Psychologist, 46, 1086-1087.

*Rosenthal, R. & Gaito, J. (1963). The interpretation of level of significance by psychological researchers. Journal og Psychology, 55, 33-38.

Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.

Rowley, G.L. (1976). The reliability of observational measures. American Educational Research Journal, 13, 51-59.

Rozeboom, W.W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.

Rozeboom, W.W. (1997). Good science is abductive, not hypothetico-deductive. In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 335-392). Mahwah, NJ: Erlbaum.

Schmidt, F. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1(2), 115-129.

Schmidt, F.L., & Hunter, J.E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Erlbaum.

Schneider, A. L., & Darcy, R. E. (1984). Policy implications of using significance tests in evaluation research. Evaluation Review, 8, 573-582.

Shaver, J. (1985). Chance and nonsense. Phi Delta Kappan, 67(1), 57-60.

Shaver, J. (1993). What statistical significance testing is, and what it is not. Journal of Experimental Education, 61, 293-316.

Shea, C. (1996). Psychologists debate accuracy of "significance test." Chronicle of Higher Education, 42(49), A12, A16.

Snyder, P. (1991). Three reasons why stepwise regression methods should not be used by researchers. In B. Thompson (Ed.), (1991). Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 99-105). Greenwich, CT: JAI Press.

Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 334-349.

Snyder, P.A., & Thompson, B. (in press). Use of tests of statistical significance and other analytic choices in a school psychology journal: Review of practices and suggested alternatives. School Psychology Quarterly.

Thompson, B. (1984). Canonical correlation analysis: Uses and interpretation. Newbury Park, CA: Sage.

Thompson, B. (1985). Alternate methods for analyzing data from experiments. Journal of Experimental Education, 54, 50-55.

Thompson, B. (1988a, November). Common methodology mistakes in dissertations: Improving dissertation quality. Paper presented at the annual meeting of the Mid-South Educational Research Association, Louisville, KY. (ERIC Document Reproduction Service No. ED 301 595)

Thompson, B. (1988b). Program FACSTRAP: A program that computes bootstrap estimates of factor structure. Educational and Psychological Measurement, 48, 681-686.

Thompson, B. (1989). Why won't stepwise methods die?. Measurement and Evaluation in Counseling and Development, 21(4), 146-148.

Thompson, B. (1990). ALPHAMAX: A program that maximizes coefficient alpha by selective item deletion. Educational and Psychological Measurement, 50, 585-589.

Thompson, B. (1991). A primer on the logic and use of canonical correlation analysis. Measurement and Evaluation in Counseling and Development, 24(2), 80-95.

Thompson, B. (1992a). DISCSTRA: A computer program that computes bootstrap resampling estimates of descriptive discriminant analysis function and structure coefficients and group centroids. Educational and Psychological Measurement, 52, 905-911.

Thompson, B. (1992b). Misuse of ANCOVA and related "statistical control" procedures. Reading Psychology, 13, iii-xviii.

Thompson, B. (1992c). Two and one-half decades of leadership in measurement and evaluation. Journal of Counseling and Development, 70, 434-438.

Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61, 361-377.

Thompson, B. (1994a, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Service No. ED 368 771)

Thompson, B. (1994b). The concept of statistical significance testing (An ERIC/AE Clearinghouse Digest #EDO-TM-94-1). Measurement Update, 4(1), 5-6. (ERIC Document Reproduction Service No. ED 366 654)

Thompson, B. (1994c). Guidelines for authors. Educational and Psychological Measurement, 54(4), 837-847.

Thompson, B. (1994d). The pivotal role of replication in psychological research: Empirically evaluating the replicability of sample results. Journal of Personality, 62, 157-176.

Thompson, B. (1994e, February). Why multivariate methods are usually vital in research: Some basic concepts. Paper presented as a Featured Speaker at the biennial meeting of the Southwestern Society for Research in Human Development, Austin, TX. (ERIC Document Reproduction Service No. ED 367 687)

Thompson, B. (1995a). Exploring the replicability of a study's results: Bootstrap statistics for the multivariate case. Educational and Psychological Measurement, 55, 84-94.

Thompson, B. (1995b). Review of Applied discriminant analysis by C.J Huberty. Educational and Psychological Measurement, 55, 340-350.

Thompson, B. (1995c). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55, 525-534.

Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher, 25(2), 26-30.

Thompson, B. (1997a). Editorial policies regarding statistical significance tests: Further comments. Educational Researcher, 26(5), 29-32.

Thompson, B. (1997b). The importance of structure coefficients in structural equation modeling confirmatory factor analysis. Educational and Psychological Measurement, 57, 5-19.

Thompson, B. (1998). Review of What if there were no significance tests? by L. Harlow, S. Mulaik & J. Steiger (Eds.). Educational and Psychological Measurement, 58, 332-344.

Thompson, B. (in press-a). Canonical correlation analysis. In L. Grimm & P. Yarnold (Eds.), Reading and understanding multivariate statistics (Vol. 2). Washington, DC: American Psychological Association.

Thompson, B. (in press-b). If statistical significance tests are broken/misused, what practices should supplement or replace them?. Theory & Psychology.

Thompson, B. (in press-c). In praise of brilliance, where that praise really belongs. American Psychologist.

Thompson, B. (in press-d). Why "encouraging" effect size reporting isn't working: The etiology of researcher resistance to changing practices. Journal of Psychology.

Thompson, B., & Borrello, G. M. (1985). The importance of structure coefficients in regression research. Educational and Psychological Measurement, 45, 203-209.

Thompson, B., & Daniel, L.G. (1996a). Factor analytic evidence for the construct validity of scores: An historical overview and some guidelines. Educational and Psychological Measurement, 56, 213-224.

Thompson, B., & Daniel, L.G. (1996b). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and Psychological Measurement, 56, 741-745.

*Thompson, B., & Snyder, P.A. (1997). Statistical significance testing practices in the Journal of Experimental Education. Journal of Experimental Education, 66, 75-83.

*Thompson, B., & Snyder, P.A. (in press). Statistical significance and reliability analyses in recent JCD research articles. Journal of Counseling and Development.

Travers, R.M.W. (1983). How research has changed American schools: A history from 1840 to the present. Kalamazoo, MI: Mythos Press.

Tuckman, B.W. (1990). A proposal for improving the quality of published educational research. Educational Researcher, 19(9), 22-24.

Vacha-Haase, T. (1998a). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 6-20.

Vacha-Haase, T. (1998b, August). A review of APA journals' editorial policies regarding statistical significance testing and effect size. Paper presented at the annual meeting of the American Psychological Association, San Francisco.

*Vacha-Haase, T., & Nilsson, J.E. (in press). Statistical significance reporting: Current trends and usages within MECD. Measurement and Evaluation in Counseling and Development.

Vacha-Haase, T., & Thompson, B. (in press). Further comments on statistical significance tests. Measurement and Evaluation in Counseling and Development.

*Vockell, E.L., & Asher, W. (1974). Perceptions of document quality and use by educational decision makers and researchers. American Educational Research Journal, 11, 249-258.

*Wandt, E. (1967). An evaluation of educational research published in journals (Report of the Committee on Evaluation of Research). Washington, DC: American Educational Research Association.

*Ward, A.W., Hall, B.W., & Schramm, C.E. (1975). Evaluation of published educational research: A national survey. American Educational Research Journal, 12, 109-128.

*Willson, V.L. (1980). Research techniques in AERJ articles: 1969 to 1978. Educational Researcher, 9(6), 5-10.

*Zuckerman, M., Hodgins, H.S., Zuckerman, A., & Rosenthal, R. (1993). Contemporary issues in the analysis of data: A survey of 551 psychologists. Psychological Science, 4, 49-53.

Zwick, R. (1997, March). Would the abolition of significance testing lead to better science? Paper presented at the annual meeting of the American Educational Research Association, Chicago.


Table 1
Correlation Coefficients for Selected
Holzinger and Swineford (1939) Data Used to Illustrate That SEM is the Most General Case of the General Linear Model

       T6       T7       T2       T4       T20      T21     
T22  
T6    1.0000    .7332 |  .1529    .1586    .3440    .3206    .4476
T7     .7332   1.0000 |  .1394    .0772    .3367    .3020    .4698
T2     .1529    .1394 | 1.0000    .3398    .2812    .2433    .2812
T4     .1586    .0772 |  .3398   1.0000    .3243    .3310    .3062
T20    .3440    .3367 |  .2812    .3243   1.0000    .3899    .3947
T21    .3206    .3020 |  .2433    .3310    .3899   1.0000    .3767
T22    .4476    .4698 |  .2812    .3062    .3947    .3767   1.0000

Note. The variable labels for these seven variables are:
  T6 PARAGRAPH COMPREHENSION TEST
  T7 SENTENCE COMPLETION TEST
  T2 CUBES, SIMPLIFICATION OF BRIGHAM'S SPATIAL RELATIONS TEST
  T4 LOZENGES FROM THORNDIKE--SHAPES FLIPPED OVER THEN IDENTIFY
TARGET
  T20 DEDUCTIVE MATH ABILITY
  T21 MATH NUMBER PUZZLES
  T22 MATH WORD PROBLEM REASONING

Table 2
Standardized Canonical Function Coefficients for the Table 1 Data
Derived Using the Appendix A SPSS/LISREL Program to Illustrate That
SEM is the Most General Case of the General Linear Model

Standardized canonical coefficients for DEPENDENT variables

Variable                  1                2
T6                   .44962         -1.40007
T7                   .62246          1.33225


Standardized canonical coefficients for COVARIATES

COVARIATE                 1                2
 T2                  -.01468           .06704
 T4                  -.20012         -1.00653
 T20                  .34100          -.02762
 T21                  .26772          -.17401
 T22                  .73104           .35974

Table 3
LISREL "Gamma" Coefficients for the Table 1 Data
Derived Using the Appendix A SPSS/LISREL Program to Illustrate That
SEM is the Most General Case of the General Linear Model

GAMMA
              T6         T7
ETA 1    0.44957    0.62250

GAMMA
              T2         T4        T20        T21        T22
ETA 1   -0.01468   -0.20014    0.34100    0.26772    0.73104

GAMMA
              T6         T7
ETA 1    0.44956    0.62251
ETA 2    1.40013   -1.33228

GAMMA
              T2         T4        T20        T21        T22
ETA 1   -0.01469   -0.20014    0.34101    0.26771    0.73104
ETA 2   -0.06706    1.00653    0.02762    0.17402   -0.35972

Note. The LISREL coefficients for the "gamma" matrix exactly match (within rounding error) the canonical function coefficients presented previously. The only exception is that all the signs for the SEM second canonical function coefficients must be "reflected." "Reflecting" a function (changing all the signs on a given function, factor, or equation) is always permissible, because the scaling of psychological constructs is arbitrary. Thus, the SEM and the canonical analysis derived the same results.

Table 4
The Confusing Language of Statistics
(Intentionally Designed to Confuse the Graduate Students)

                                             Synthetic/
                Standardized    Weight       Latent
Analysis        Weightsa        System       Variable(s)

Multiple
Regression      beta            "equation"   Yhat

Factor          pattern         "factor"     factor
Analysis        coefficients                 scores

Descriptive     standardized    "function"   discriminant
Discriminant    function           -or-      function
Analysis        coefficients    "rule"       scores

Canonical       standardized                 canonical
Correlation     function        "function"   function
Analysis        coefficients                 scores

aOf course, the term, "standardized weight", is an obvious oxymoron. A given weight is a constant applied to all the scores of all the cases/people on the observed/manifest/ measured variable, and therefore cannot be standardized. Instead, the weighting constant is applied to the measured variable in its standardized form, i.e., we should say "weight for the standardized measured variables" rather than "standardized weight".

Table 5
Holzinger and Swineford Data to Show
That More Predictors May Actually Hurt Classification Accuracy

                    Seq  ID GRADE T13 T17 T22 T16
                      1   2   7   285  12  21 100
                      2   3   7   159   1  18  95
                      3   9   7   265  18  18 105
                      4  14   7   211   8  22 103
                      5  16   7   211   5  34 102
                      6  18   7   189  13  16 100
                      7  20   7   207   3  47 107
                      8  22   7   194   8  19  96
                      9  25   7   244   6  20  99
                     10  28   7   163  12  24 106
                     11  30   7   310  10  20 101
                     12  34   7   121   3  18  92
                     13  44   7   167  11  22 112
                     14  46   7   100   4  25  58
                     15  47   7   240   6  20 103
                     16  50   7   226   4  39 109
                     17  51   7   196   8  18  96
                     18  52   7   218   7  18  92
                     19  58   7   151  15  25 102
                     20  66   7   142   3  13  95
                     21  68   7   172  10  32 110
                     22  71   7   181   9  27 107
                     23  74   7   153  15  21  99
                     24  75   7   141  14  19 107
                     25  76   7   195  10  19 103
                     26  78   7   186   7  30 109
                     27  79   7   215  10  15 103
                     28  81   7   165  11  22 108
                     29  83   7   233   2  28 100
                     30  85   7   203   8  24 103
                     31 202   7   195   9  22 106
                     32 203   7   228   1  43 101
                     33 205   7   160   9  35  99
                     34 208   7   333  16  45 118
                     35 213   7   154   3  19 106
                     36 225   7   236  21  29 116
                     37 226   7   219   6  23 104
                     38 230   7   189   1   7  99
                     39 232   7   143   2  27  94
                     40 235   7   162   3  16 100
                     41 236   7   205   6  27 101
                     42 239   7   112   3  18  90
                     43 244   7   137   0  24 105
                     44 245   7   214   4  26 100
                     45 250   7   120   3  28 112
                     46 252   7   165   1  10 101
                     47 253   7   137   1  15  89
                     48 256   7   214   4  28  97
                     49 257   7   223   5  23 106
                     50 263   7   205   5  35 103
                     51 264   7   180   6  36  97
                     52 268   7   130   3  14 103
                     53 269   7   220   4  31 113
                     54 277   7   149   1  21  96
                     55  86   8   207  19  37 112
                     56  88   8   217  24  20 106
                     57  89   8   191  10  27 109
                     58  90   8   208   9  17  98
                     59 106   8   260  17  41 104
                     60 112   8   148  11  34 105
                     61 118   8   271  11  34 113
                     62 120   8   175  10  24 111
                     63 126   8   180  11  21  96
                     64 131   8   247  20  26 101
                     65 132   8   119   2  28  91
                     66 133   8   234  14  44 113
                     67 134   8   172  23  26  99
                     68 137   8   177  11  25  93
                     69 139   8   208  18  34 107
                     70 140   8   227   9  13 108
                     71 143   8   259  16  23 107
                     72 148   8   196   7  39  96
                     73 150   8   248  17  32 110
                     74 151   8   255  26  34 112
                     75 153   8   206  11  16 105
                     76 155   8   238  16  49 102
                     77 158   8   227  18  15 101
                     78 160   8   197   6  25 100
                     79 165   8   195   9  29  91
                     80 282   8   241   1  27 115
                     81 283   8   230   4  26 103
                     82 284   8   200  11   8 108
                     83 285   8   246  16  33 109
                     84 287   8   227  11  48 109
                     85 288   8   168  11  28 104
                     86 289   8   224  13  43 104
                     87 290   8   189   7  38 110
                     88 297   8   199   8  30 108
                     89 298   8   249  15  50 119
                     90 299   8   212   7  29 102
                     91 304   8   210   5  27 104
                     92 311   8   198   7  34 107
                     93 312   8   237   6  18 108
                     94 313   8   206  15  50 107
                     95 315   8   215   5  27 101
                     96 317   8   183   9  18 113
                     97 318   8   187   8  35 109
                     98 322   8   220   7  26 109
                     99 323   8   178   8  27 103
                    100 324   8   150   6   8 102
                    101 329   8   235   6  18 101
                    102 338   8   206  26  37 113
                    103 341   8   174   7  46 105
                    104 342   8   162   9  29  96
                    105 343   8   228   1  39 104
                    106 345   8   204   7  25 112
                    107 351   8   186  25  39 109

Note.  The variable labels are:
                T13 SPEEDED DISCRIM STRAIGHT AND CURVED CAPS
                T17 MEMORY OF OBJECT-NUMBER ASSOCIATION TARGETS
                T22 MATH WORD PROBLEM REASONING
                T16 MEMORY OF TARGET SHAPES

Table 6
Holzinger and Swineford Results to Show That
More Predictors May Actually Hurt Classification Accuracy
--LDF and LCF Score Classification Tables--

GRADE  by  LDFCL3  LDF classification  3 predictors
            Count  I
                   I
                   I                Row
                   I     7I     8I Total
GRADE      --------+------+------+
                7  I    40I    14I    54
                   I      I      I  50.5
                   +------+------+
                8  I    22I    31I    53
                   I      I      I  49.5
                   +------+------+
            Column      62     45    107
             Total    57.9   42.1  100.0



GRADE  by  LDFCL4  LDF classification  4 predictors
            Count  I
                   I
                   I                Row
                   I     7I     8I Total
GRADE      --------+------+------+
                7  I    38I    16I    54
                   I      I      I  50.5
                   +------+------+
                8  I    23I    30I    53
                   I      I      I  49.5
                   +------+------+
            Column      61     46    107
             Total    57.0   43.0  100.0
 

GRADE  by  LCFCL3  LCF classification  3 predictors
            Count  I
                   I
                   I                Row
                   I     7I     8I Total
GRADE      --------+------+------+
                7  I    40I    14I    54
                   I      I      I  50.5
                   +------+------+
                8  I    22I    31I    53
                   I      I      I  49.5
                   +------+------+
            Column      62     45    107
             Total    57.9   42.1  100.0



GRADE  by  LCFCL4  LCF classification  4 predictors
            Count  I
                   I
                   I                Row
                   I     7I     8I Total
GRADE      --------+------+------+

                7  I    38I    16I    54
                   I      I      I  50.5
                   +------+------+
                8  I    23I    30I    53
                   I      I      I  49.5
                   +------+------+
            Column      61     46    107
             Total    57.0   43.0  100.0

Table 7
Holzinger and Swineford Results to Show That
More Predictors May Actually Hurt Classification Accuracy
--Both LDF and LCF Actual Classifications--

        Seq  ID GRADE LDFCL3 LDFCL4 LCFCL3 LCFCL4
          1   2   7      8      8      8      8
          2   3   7      7      7      7      7
          3   9   7      8      8      8      8
          4  14   7      7      7      7      7
          5  16   7      7      7      7      7
          6  18   7      7      7      7      7
          7  20   7      8      8      8      8
          8  22   7      7      7      7      7
          9  25   7      7      7      7      7
         10  28   7      8      8      8      8
         11  30   7  +   8      7      8      7
         12  34   7      7      7      7      7
         13  44   7  -   7      8      7      8
         14  46   7      7      7      7      7
         15  47   7      7      7      7      7
         16  50   7      8      8      8      8
         17  51   7      7      7      7      7
         18  52   7      7      7      7      7
         19  58   7      8      8      8      8
         20  66   7      7      7      7      7
         21  68   7      8      8      8      8
         22  71   7  -   7      8      7      8
         23  74   7      8      8      8      8
         24  75   7      8      8      8      8
         25  76   7      7      7      7      7
         26  78   7  -   7      8      7      8
         27  79   7      7      7      7      7
         28  81   7  -   7      8      7      8
         29  83   7      7      7