The 7Step and 9Step Versions of Hypothesis
Testing
Dear Students,
In an attempt to help you with the 7step and 9step
versions of hypothesis testing, I'd like to do two things right now in
this email message. First, I'm going to type out a paragraph of material
that appears in the 4page journal article that I gave you at the end
of class yesterday. Then, I'll try to draw a connection between the quoted
paragraph and the two versions of hypothesis testing discussed in Ch.
8.
Here's the paragraph from the Olds/Abernethy article:
"PostExercise Oxygen Consumption Following Heavy and Light Resistance
Exercise":
Statistical Analyses
Standard descriptive statistics were used in all analyses.
Oneway analyses of variance (ANOVA) was used to compare ROC following
the three conditions. Fisher's PLSD test was used in post hoc analyses.
Effect sizes were calculated between the resting and light, and [between]
the resting and heavy conditions. The effect size is the difference between
the mean values of two samples divided by the standard deviation. Effect
sizes are often used to give an idea of the magnitude of the difference
between groups, and to circumvent difficulties arising when the sample
size is small (failure to reach statistical significance) or large (where
small differences may be statistically, but not practically, significant).
Effect sizes of about 0.2., 0.5, and 0.8 are often categorized as small,
moderate, and large, respectively.
Ok. That's the paragraph from the Olds/Abernethy article.
And here come a few comments from Sky:
 Don't worry AT ALL about the 2nd or 3rd sentences; they are
NOT germane to the material in Chapter 8.
 The most important part of this passage is contained in the nexttolast
sentence. The exceedingly important point made by the authors in this
sentence is as follows:
 In hypothesis testing, a sample size that's two
small is likely to produce a failtoreject decision . . .even if
there's a large and true difference between the means of the populations
associated with the two samples being compared. In other words,
a sample size that's two small will tend to bring forth a Type II
error, thus causing the researcher to "miss" detecting
something important that occurred.
 In hypothesis testing, a sample size that's too
large is likely to produce a reject decision . . . even if there's
a small and trivial difference between the means of the populations
associated with the two samples being compared. In other words,
a sample size that's too big can create a situation wherein a "significant
difference" is shown to exist between two sample means . .
. when, in fact, the difference between the sample means is tiny
and "significant" only in a statistical sense and NOT
AT ALL in a practical sense!!!!!!!!!
 The concept of "effect size" is at
the heart of the 9step version of hypothesis testing . . . and the
ES is simply a judgment by the researcher as to the line of demarcation
between "small" (i.e., trivial) differences, on the one hand,
and "large" (i.e., important) differences, on the other. The
9step version of hypothesis testing requires a researcher to specify
his/her opinion as to ES . . . and then the appropriate sample size
is determined (via a formula or a chart) so the researcher will have
a decent chance of reaching a "reject" decision if there is
a biggerthanES difference between the study's population means. What's
a "decent chance" is usually considered to be about 80%, and
this number, when converted from a percentage to a decimal, is called
the test's "power." In summary, the 9step version of hypothesis
testing allows a researcher to design his/her study so the sample size(s)
won't be too small or too large; this is done by (a) asking the researcher
to specify ES, (b) asking the researcher to specify power, and (3) asking
the researcher to determine (via formula or chart) the "proper"
number of subjects to have in the study's sample(s). These 3 steps are
positioned as Steps #4, #5, and #6 of the 9step version of hypothesis
testing.
 In the 7step version of hypothesis testing,
the researcher simply adds a step after rejecting or failingtoreject
the null hypothesis with the 6step procedure (wherein there's no preplanned
ES, power specification, and sample size determination). The 7th step
involves using the sample data to either (1) show that sample size was
not "too small" or "too large" or (2) look at the
sample means and evaluate the observed difference as being either smallandtrivial
or bigandimportant. The first of these options asks the researcher
to specify ES (as in the 9step version of hypothesis testing) and then
determine, via formula or chart, how much power there was in light of
the sample sizes that were used. The second of these options allows
the researcher to compute a strengthofassociation index, or to estimate
the observed magnitude of effect.
 In the Olds/Abernethy study, the 7step version
of hypothesis testing was used. After rejecting or failingtoreject
any null hypothesis, the researchers computed the "effect size"
based upon the sample data. They deserve high marks for going past the
6 steps that form the most basic version of hypothesis testing. What
they did in the 7th step, using my terminology, was to compute an estimate
of the observed magnitude of effect. I prefer my longer term to their
short phrase "effect size," because they computed their "effect
sizes" from the data and are, in a very real sense, telling us
"what they got" (rather than using the term "effect size"
to tell us their OPINION as to the line of demarcation between small
and large differences). Olds and Abernethy do, however, tie their computed
"effect sizes" back to some judgmental "benchmarks,"
for they make reference to Cohen's standardized numerical values of
0.2 (a "small" difference), 0.5 (a "medium" difference),
and 0.8 (a "large" difference). Sky Huck
