What Is The Purpose Of Finding The Standardized Effect Size Of A Study

J Grad Med Educ. 2012 Sep; four(three): 279–282.

Using Outcome Size—or Why the P Value Is Not Enough

Statistical significance is the to the lowest degree interesting matter about the results. You lot should describe the results in terms of measures of magnitude –not only, does a treatment touch on people, only how much does it affect them.

-Gene V. Drinking glass ¹

The chief product of a research inquiry is i or more than measures of outcome size, not P values.

-Jacob Cohen ²

These statements about the importance of effect sizes were made by two of the most influential statistician-researchers of the past one-half-century. Yet many submissions to Periodical of Graduate Medical Education omit mention of the effect size in quantitative studies while prominently displaying the P value. In this newspaper, we target readers with piddling or no statistical background in club to encourage you to improve your comprehension of the relevance of effect size for planning, analyzing, reporting, and understanding education research studies.

What Is Effect Size?

In medical education research studies that compare unlike educational interventions, issue size is the magnitude of the deviation between groups. The absolute effect size is the difference between the average, or mean, outcomes in ii different intervention groups. For example, if an educational intervention resulted in the comeback of subjects' examination scores by an average total of 15 of 50 questions as compared to that of another intervention, the absolute effect size is fifteen questions or 3 grade levels (30%) better on the examination. Accented event size does not take into account the variability in scores, in that not every subject accomplished the average outcome.

In another example, residents' self-assessed confidence in performing a procedure improved an boilerplate of 0.4 indicate on a Likert-type scale ranging from one to 5, afterward simulation preparation. While the accented consequence size in the kickoff case appears clear, the consequence size in the 2nd instance is less apparent. Is a 0.4 change a lot or little? Accounting for variability in the measured improvement may assistance in interpreting the magnitude of the modify in the 2d case.

Thus, effect size can refer to the raw difference between grouping means, or accented result size, equally well every bit standardized measures of effect, which are calculated to transform the upshot to an easily understood calibration. Absolute upshot size is useful when the variables under study have intrinsic pregnant (eg, number of hours of sleep). Calculated indices of result size are useful when the measurements have no intrinsic significant, such as numbers on a Likert calibration; when studies take used different scales and then no direct comparison is possible; or when effect size is examined in the context of variability in the population under study.

Calculated consequence sizes can also quantitatively compare results from different studies and thus are commonly used in meta-analyses.

Why Written report Issue Sizes?

The upshot size is the main finding of a quantitative study. While a P value can inform the reader whether an consequence exists, the P value will not reveal the size of the effect. In reporting and interpreting studies, both the substantive significance (outcome size) and statistical significance (P value) are essential results to be reported.

For this reason, outcome sizes should be reported in a paper's Abstruse and Results sections. In fact, an estimate of the effect size is often needed earlier starting the research endeavor, in order to calculate the number of subjects likely to be required to avert a Type II, or β, error, which is the probability of final there is no effect when i actually exists. In other words, you must decide what number of subjects in the report will be sufficient to ensure (to a particular degree of certainty) that the report has acceptable power to back up the nix hypothesis. That is, if no difference is constitute between the groups, and so this is a truthful finding.

Why Isn't the P Value Enough?

Statistical significance is the probability that the observed difference between 2 groups is due to chance. If the P value is larger than the alpha level chosen (eg, .05), any observed divergence is causeless to be explained by sampling variability. With a sufficiently large sample, a statistical test volition about always demonstrate a pregnant difference, unless in that location is no result whatsoever, that is, when the effect size is exactly nothing; yet very small differences, even if significant, are often meaningless. Thus, reporting only the significant P value for an analysis is non adequate for readers to fully understand the results.

For example, if a sample size is ten 000, a pregnant P value is probable to exist constitute even when the deviation in outcomes between groups is negligible and may non justify an expensive or time-consuming intervention over another. The level of significance past itself does not predict upshot size. Unlike significance tests, effect size is independent of sample size. Statistical significance, on the other hand, depends upon both sample size and effect size. For this reason, P values are considered to be confounded because of their dependence on sample size. Sometimes a statistically pregnant effect ways only that a huge sample size was used.ⁱⁱⁱ

A commonly cited instance of this problem is the Physicians Wellness Study of aspirin to foreclose myocardial infarction (MI).⁴ In more than than 22 000 subjects over an average of 5 years, aspirin was associated with a reduction in MI (although not in overall cardiovascular mortality) that was highly statistically meaning: P < .00001. The written report was terminated early on due to the conclusive testify, and aspirin was recommended for general prevention. However, the effect size was very modest: a take a chance difference of 0.77% with r ² = .001—an extremely small-scale effect size. Equally a upshot of that study, many people were advised to have aspirin who would not experience benefit yet were likewise at risk for agin effects. Farther studies found even smaller furnishings, and the recommendation to use aspirin has since been modified.

How to Summate Effect Size

Depending upon the type of comparisons under study, issue size is estimated with unlike indices. The indices fall into 2 main study categories, those looking at effect sizes between groups and those looking at measures of clan between variables ( table i ). For two independent groups, effect size can be measured by the standardized deviation between two means, or hateful (group 1) – mean (group ii) / standard departure.

TABLE one

Mutual Effect Size Indices^a

An external file that holds a picture, illustration, etc. Object name is i1949-8357-4-3-279-t01.jpg

The denominator standardizes the difference by transforming the absolute deviation into standard deviation units. Cohen's term d is an example of this type of effect size alphabetize. Cohen classified consequence sizes equally small (d = 0.2), medium (d = 0.5), and big (d ≥ 0.8).^five According to Cohen, "a medium effect of .v is visible to the naked center of a conscientious observer. A minor event of .2 is noticeably smaller than medium but not so small as to exist trivial. A large issue of .eight is the same altitude above the medium as small is below it."⁶ These designations large, medium, and small do not accept into business relationship other variables such every bit the accurateness of the cess instrument and the diversity of the study population. However these ballpark categories provide a general guide that should also be informed past context.

Between group means, the consequence size tin also be understood every bit the boilerplate percentile distribution of group 1 vs. that of group 2 or the amount of overlap between the distributions of interventions ane and ii for the two groups nether comparison. For an effect size of 0, the mean of group 2 is at the 50th percentile of grouping 1, and the distributions overlap completely (100%)—that is , at that place is no difference. For an result size of 0.eight, the hateful of group 2 is at the 79^thursday percentile of group 1; thus, someone from group 2 with an average score (ie, hateful) would have a higher score than 79% of the people from group 1. The distributions overlap past just 53% or a not-overlap of 47% in this state of affairs ( tabular array 2 ).^v ^, ⁶

Table 2

Differences between Groups, Upshot Size measured by Glass'south Δ^a

An external file that holds a picture, illustration, etc. Object name is i1949-8357-4-3-279-t02.jpg

What Is Statistical Power and Why Practise I Need It?

Statistical power is the probability that your study will find a statistically pregnant difference betwixt interventions when an actual difference does exist. If statistical ability is high, the likelihood of deciding there is an effect, when one does exist, is loftier. Power is 1-β, where β is the probability of wrongly last in that location is no effect when one actually exists. This type of error is termed Type II error. Like statistical significance, statistical power depends upon upshot size and sample size. If the effect size of the intervention is large, it is possible to detect such an effect in smaller sample numbers, whereas a smaller effect size would require larger sample sizes. Huge sample sizes may detect differences that are quite small and perhaps petty.

Methods to increase the ability of your study include using more potent interventions that have bigger furnishings, increasing the size of the sample/subjects, reducing measurement error (utilize highly valid outcome measures), and raising the α level but but if making a Type I fault is highly unlikely.

How To Calculate Sample Size?

Earlier starting your study, summate the ability of your study with an estimated effect size; if power is besides low, you may need more subjects in the study. How can yous estimate an result size before carrying out the written report and finding the differences in outcomes? For the purpose of calculating a reasonable sample size, effect size tin exist estimated by pilot written report results, similar piece of work published by others, or the minimum departure that would be considered important by educators/experts. There are many online sample size/power calculators available, with explanations of their utilize (BOX).^seven ^, ⁸

Box. Calculation of Sample Size Example

Your pilot study analyzed with a Student t-test reveals that group i (Due north = 29) has a mean score of 30.1 (SD, ii.eight) and that grouping 2 (Due north = 30) has a mean score of 28.five (SD, iii.5). The calculated P value = .06, and on the surface, the divergence appears not significantly unlike. However, the calculated effect size is 0.v, which is considered "medium" according to Cohen. In order to test your hypothesis and decide if this finding is real or due to chance (ie, to find a pregnant difference), with an effect size of 0.5 and P of <.05, the power volition be too low unless you lot expand the sample size to approximately N = threescore in each group, in which case, power will accomplish .fourscore. For smaller outcome sizes, to avoid a Type 2 error, you would demand to farther increase the sample size. Online resources are bachelor to help with these calculations.

Power must be calculated prior to starting the report; post-hoc calculations, sometimes reported when prior calculations are omitted, have limited value due to the wrong assumption that the sample effect size represents the population upshot size.

Of interest, a β error of 0.ii was called by Cohen, who postulated that an α error was more than serious than a β mistake. Therefore, he estimated the β error at iv times the α: four × 0.05 = 0.20. Although capricious, as this has been copied by researchers for decades, use of other levels will need to exist explained.

Summary

Event size helps readers empathise the magnitude of differences establish, whereas statistical significance examines whether the findings are likely to be due to adventure. Both are essential for readers to understand the full impact of your work. Report both in the Abstruse and Results sections.

Footnotes

Gail M Sullivan, Doc, MPH, is Editor-in-Chief, Journal of Graduate Medical Education; Richard Feinn, PhD, is Assistant Professor, Department Psychiatry, University of Connecticut Health Heart.

References

i. Kline RB. Across Significance Testing: Reforming Data Analysis Methods in Behavioral Enquiry. Washington DC: American Psychological Association; 2004. p. 95. [Google Scholar]

2. Cohen J. Things I have learned (and so far) Am Psychol. 1990;45:1304–1312. [Google Scholar]

4. Bartolucci AA, Tendera K, Howard G. Meta-analysis of multiple main prevention trials of cardiovascular events using aspirin. Am J Cardiol. 2011;107(12):1796–801. [PubMed] [Google Scholar]

vi. Coe R. Information technology's the effect size, stupid: what "effect size" is and why it is important. Paper presented at the 2002 Annual Briefing of the British Educational Research Association, Academy of Exeter, Exeter, Devon, England, September 12–14, 2002. http://www.leeds.ac.united kingdom of great britain and northern ireland/educol/documents/00002182.htm. Accessed March 23, 2012. [Google Scholar]

Articles from Journal of Graduate Medical Didactics are provided here courtesy of Accreditation Council for Graduate Medical Didactics