Quantifying Effect Sizes in Randomised and Controlled Trials: A Review
Meta-analysis aggregates quantitative outcomes from multiple scientific studies to produce comparable effect sizes. The resultant integration of useful information leads to a statistical estimate with higher power and more reliable point estimate when compared to the measure derived from any individual study. Effect sizes are usually estimated using mean differences of the outcomes of treatment and control groups in experimental studies. Although different software exists for the calculations in meta-analysis, understanding how the calculations are done can be useful to many researchers, particularly where the values reported in the literature data is not applicable in the software available to the researcher. In this paper, search was conducted online primarily using Google and PubMed to retrieve relevant articles on the different methods of calculating the effect sizes and the associated confidence intervals, effect size correlation, p values and I2, and how to evaluate heterogeneity and publication bias are presented.
Size of Effects
“Meta-analysis is a statistical analysis” developed by Glass1 in 1976 to perform a “relatively powerful evaluation of a specific hypothesis and to draw quantitative inferences. It integrates the quantitative findings from multiple scientific, but similar studies, and provides a numerical estimate of the overall effect of interest”1–3. In randomised and controlled trials, the effect of interest can be “(i) an average of a continuous variable, (ii) a correlation between two variables, (iii) an odds ratio (suitable for analyzing retrospective studies), (iv) a relative risk (risk ratio) or risk difference (suitable for analyzing prospective studies), or (v) a proportion”. Randomised studies are often considered to reduce bias problem while studies with controls are the ones selected because effect sizes of the control and treatment groups are the ones that are compared. While a study may combine many studies to determine the effect size of a particular outcome (e.g., cure of malaria), another may compare different effects (e.g., cure of malaria, incidence of recrudescence, ant malarial resistance, side effects, etc) from the same sets of studies included in the study design.
The basic principle behind meta-analyses lies on a common fact behind all conceptually similar scientific studies measured with a certain error which is within the individual studies. Approaches from statistics are then “applied to derive a pooled estimate nearest to the unknown common fact based on how the error is perceived”4–6. Different weights are usually assigned to the different studies for calculating the pooled effect. This weighting is related with the inverse of the variance and hence indirectly to the sample size reported in the studies. Any set of studies with smaller standard deviation and larger sample size are given more weight in the calculation of the pooled effect size. “The agreement or disagreement between the studies is examined using different measures of heterogeneity which refers to the variation in study outcomes between studies”4–6. Other than providing estimate of the unknown common truth, “meta-analyses have the ability to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies”6–8.
The major benefit of this approach is the aggregation of useful information that leads to a higher statistical power and more reliable point estimate when compared to the measure derived from any individual study7. Even a low-powered meta-analysis utilizing a small number of studies can still provide useful information. Thus, researchers are often motivated to include meta-analysis in systematic reviews for reasons which include (1) increasing power, (2) to improve precision, (3) to answer questions not posed by the individual studies, and (4) to settle controversies or generate new hypotheses6.
Models in Meta-Analysis
Two models are commonly used in meta-analysis namely, the fixed-effect (common effect) and random-effect models. Under the fixed-effect model, it is assumed that all studies included (i) investigate the same population, (ii) apply the same variable and outcome definitions, (iii) have one true effect size that underlies all the studies in the analysis, and (iv) all differences in observed effects are due to sampling error. The inverse of the variance from the weighted average of different study estimates (wi = 1/σ2 where wi is the weight of individual study, and σ is the variance within the studies) “is commonly used as study weight, such that larger studies tend to contribute more than smaller studies to the weighted average”. Thus, when studies included a meta-analysis are “dominated by a very large study, the findings from smaller studies are practically ignored”. However, the assumption here is considered unrealistic since research is often disposed to different sources of heterogeneity- “a measure of the level of inconsistency in different studies”. In the random effect model, the assumption is that “the true effect size might differ from one study to another. For example, the effect size might be higher (or lower) in studies where the participants are older, or more educated, or healthier than in other studies, or when a more intensive variant of an intervention is used”. Random effect is simply the weighted average of the effect sizes of a group of studies [wi = 1/(σ2 + τ2) where τ is the variance between the studies]. The term, “Random”, “reflects the fact that the studies included in the analysis are assumed to be a random sample of all possible studies that meet the inclusion criteria for the review”8. This implies that the greater heterogeneity, the greater the un-weighting which can get to a point when the random effects estimated become simply the un-weighted average effect size of the studies8.
How to Carry Out Meta-Analysis
The Cochrane handbook6, and the PRISMA statement8 as well as the report from Howard et al.,9 provide appropriate information on how to carry out systematic review and meta-analysis. In general, the steps provided in Table 1 are useful.
When studies with poor methods are included in the data set for meta-analysis, the ability of the meta-analyst to compute a strong mean effect size or identify important controlling variables may be compromised. It is therefore crucial to define and report the criteria on how studies are assessed for inclusion. In addition, it is important to thoroughly consider whether the studies being considered for inclusion can reasonably be combined11. In other words, the following questions may be asked: (1) Do the studies consider common outcome? (2) Were the outcomes measured in a similar way? (3) Were the effect sizes determined using the same outcome measure? These issues may be less important in some fields such as ecology when compared to others such as pharmacy and medicine.
That effect sizes are independent is a very important statistical assumption in meta-analysis12. “Statistical independence implies that each effect size (or sample) represents an independent entity and the pooled effect size does not have a correlated structure. Non-independence is a major consideration in data set as it can affect (i) the calculation of effect size statistics and (ii) the estimations of overall meta-analytic estimates with their uncertainty which are two major interrelated components of a meta-analysis”13. Potential sources of non-independence include multiple effect sizes extraction from a single experiment or from different time points throughout a study, an effect being measured on each individual or simulated in a study and research from multiple species, and influence of research group. Non-independence can increase type I error rates in meta-analysis. In statistical hypothesis testing, a type I error occurs when a true null hypothesis is incorrectly rejected (false positive) while a type II error is the acceptance of a false null hypothesis (false negative).
|“Step 1||A thorough literature search for studies that address the hypothesis of interest, using defined keywords and search methods is performed. This will usually include searching for unpublished studies, for example by posting requests to professional manufacturers, newsletters or mailing lists. The research question can be formulated in terms of the problem/population, intervention, comparison, and outcome (PICO)10.|
|Step 2||The resulting studies are critically appraised and evaluated for possible inclusion in the review. Possible questions to be addressed for each article include: Is the publication applicable? Are the study methods appropriate? Is there enough information to calculate an effect size? (Record the reasons for dropping any studies from your data set).|
|Step 3||An appropriate measure of effect size is selected, and the effect size is calculated for each study retained.|
|Step 4||The selected studies are entered into a master database; information to be recorded should include study identity (author, and year), effect size(s), sample size(s) and information which codes each study for variables which may affect the outcome of each study, or whose possible influence on effect size needs to be investigated (experimental design, taxonomic information on the study species, geographic location of study population, life-history variables of the species used etc). How the effect size(s) is/are calculated for each study is also recorded.|
|Step 5||A summary of the cross-study support for the hypothesis of interest is done using meta-analytical methods. Also, any variation in conclusions drawn by individual studies is explained.|
|Step 6||The robustness and power of the analysis (likelihood of type I and type II errors) are determined”8,9|
Search was conducted online primarily using Google and PubMed databases to retrieve relevant articles on the different methods being used to calculate effect sizes and the associated confidence intervals, effect size correlation, p values and I2, as well as how to evaluate heterogeneity and publication bias based on available records in the databases as at May 2017. The search terms used included, ‘effect size calculation, ‘effect size and clinical trials’, ‘calculation of effect size in clinical trials’ ‘randomised clinical trials and effect size’, ‘effect size correlation’, ‘heterogeneity and effect size’, ‘publication bias’, ‘publication bias in clinical trials’, ‘fixed effect and effect size’, and ‘ random effect and effect size’. All articles retrieved that were not in English language and had no relevant information on effect size, heterogeneity, and publication bias were excluded while the rest were evaluated to extract relevant information, and as appropriate, used to identify other relevant articles.
Data were independently extracted by the corresponding author and verified by another author.
Results and Discussion
Calculation of Effect Size in Meta-Analysis
Several ways have been used to calculate effect size, but the three most popular approaches are those of Gene Glass, Hunter-schmidt, and Cohen’s d14. While these different methods of calculation will not necessarily yield the same d values for a set of data from studies included in a study, the use of a particular method across all the studies being considered in a study will effectively compare the effect sizes of the individual studies.
In fixed-effect model, when a study reports the mean and standard deviation (variance) of a treatment and control groups, Cohen’s d can be used to calculate the stan-dardised difference between the two means as follows:
where d is the effect size, µ1 and µ2 represent the means of the effects of the treatment and control group, σ1 and σ2 are the standard deviations of the means of the effects of the treatment and control and σpooled is the pooled standard deviation. However, in random-effect model, the introduction of the between studies variance (τ) will change σ to VR where
where τ2 =σ21 +σ22 based on ‘variance sum law’ for independent variables.
When a study reports a percentage of success after taking the treatment or no treatment (hit rate), the following formula15 can be used:
d = arscine(p1) - arscine(p2) (4)
where p1 and p2 are the hit rates of the control and treatment groups, depending on the direction of the desired effect. The arcsine is the inverse of sine and the returned angle is given in radians in the range of -π/2 to π/2. In Microsoft Excel, this value is calculated as
arscine(p) = ASIN(p) (5)
where p is the proportion which must be from -1 to 1.
Using the t test value for a between subjects t statistics and the degrees of freedom
When the studies list F statistics, d can be calculating as follows:
where MSE is the mean square error, n is the number of subjects in the treatment (t) or control (c) group, and F is the reported F statistics usually given, for example, with the notation, F (dfc,dfs) = fx, where dfc is the degrees of freedom based on the number of conditions, dfs is the degrees of freedom based on the number of subjects and fx is the F value [e.g., F (1,39) = 3.12]
Effect-size correlation (r) is obtained from
For t statistics,
The r and r2 are the proportion of the variance in the sample or control ‘accounted for’ by the other -- this is the proportion of reduction of the variance of the outcome measure when it is replaced by the residuals’ variance values obtained from a regression equation. When this is extended to multiple regressions, it characterizes the proportion of the variance accounted for by all the independent variables; similar to ANOVA where it is often called ‘eta-squared’, η2. Thus, r2 is often advocated as a universal measure of effect size.
It is important to note that the means in the above equations are arranged in the direction of the effects. For example, if desired effect is increase in effect size, the control mean will be subtracted from the treatment mean. Thus, d and r are positive if the mean difference is in the predicted direction.
If the effect size estimate from the sample is d, then it is normally distributed, with the following standard deviation:
where nt is the number in the experimental group while and nc is that of the control group. Hence a 95% confidence interval for d would be from
d – 1.96 × σ to d + 1.96 × σ (13)
Odds Ratio (OR) and Relative Risks (RR)
OR and RR are other possible indices of effect in group designs. An example of a report of meta-analysis where OR was used in the estimation is shown in Figure 15.Using both fixed and random effect models, an example of a “forest plot” from meta-analysis of different studies is illustrated in Figure 2.
Odds ratio reflects the odds of a successful or desired outcome in the intervention group relative to the odds of a similar outcome in the control group. Consider the following 2x2 frequency table (Table 2).
Absolute risk reduction (ARR) = ARC – ART (16)
where ARC is the AR of events in the control group and ART is the AR of events in the treatment group.
The standard deviation (σ) of OR or RR can be calculated from
where H is the calculated value of OR or RR and a, b, x and y are as defined in equation 14.
The 95% confidence interval can be calculated from
where M is the OR or RR.
Heterogenicity and Publication Bias
Irrespective of the assumed quality of meta-analysis in research, the reliability and strength of any inferences derived from it rely on the population of individual studies included. Thus, in reporting meta-analysis, issues relating to which studies are included are vital. At the same time, it is essential to understand some approach of evaluating that the tendency for a true significantly non-zero mean effect size is the outcome of a type I error. Also, the likelihood of a zero mean effect size is the outcome of the absence of a statistical power rather than realistic reflection of the mean effect size of the population. Both type I and II error rates are affected by the number and identity of the included studies and their individual sample sizes11
In meta-analysis, heterogeneity is the variability occurring in outcomes in different studies. It is a consequence of clinical or methodological differences (or both) among the studies6. Statistical tests of heterogeneity are very popular in meta-analyses reports despite their well know limitations. Cochran’s Q is a classical measure of heterogeneity in different studies16.
Where which is the study weight, di is the individual study effect size, dm is the mean of effect size for the studies, k is the number of studies, σ is the variance within studies (σ[di]) and τ(tau) is the variance between the studies (σ[dm]). As indicated in the equation, Q is the weighted sum of squared differences between each study estimate and the pooled estimate, with the weights being those used in the pooling method. “It is distributed as a chi-square statistic with k -1 degrees of freedom (df) where k is the number of studies”16. For fixed effect model, τ = 0 (i.e., wi = 1/σ2) as it is assumed that there is no variability within the studies unlike random effect model. One commonly useful statistics for calculating inconsistency in studies is I2.
where nt and nc are the total numbers of individuals in all the studies in the treatment and control groups, respectively.
where Q is the chi-squared statistic and df is its degrees of freedom4,5. This equation describes the percentage of the variability in effect estimates that is due to heterogeneity rather than sampling error (chance). If the I2 estimate from the studies is y, the standard deviation for the distribution is given by
where nt and nc is the total number in the experimental group while nc is the total number in the control group. Hence a 95% confidence interval for I2 would be from
The thresholds for the interpretation of I2 can be misleading, as the importance of inconsistency depends on many factors. It should be noted that a low value of I2 could have only trivial heterogeneity but could also have substantial heterogeneity. However, 0% to 40% might not be important, 30% to 60% may represent moderate heterogeneity, 50% to 90% may represent substantial heterogeneity and 75% to 100% considerable heterogeneity. These cut-off points depend on magnitude and direction of effects and strength of evidence for heterogeneity such as P value from the chi-squared test, or a confidence interval for I2. However, I2 values of 25%, 50%, and 75% can be assumed to correspond to small, moderate, and large sizes of heterogeneity.
In Microsoft Excel, the function to compute a p-value for Q is
Thus, if Q = 13.4626 and df = 1, p = CHIDIST (13.4626,1) = 0.0002. Usually, if p < 0.05, the difference is assumed to be be ‘significant’.
This often represents the highest potential source of type I error (i.e., false positive) in meta-analysis. Over recent years, different nomenclatures have been developed for bias relating to publication bias. These include the selective exclusion of patients from the analysis17, outcome reporting bias18, time lag bias19, and location bias20,21. A funnel plot (Figure 3) “is a graphical tool commonly used for detecting bias in meta-analysis and systematic reviews. In this plot, treatment effect is plotted on the horizontal axis and the standard error on the vertical axis and the vertical line represents the summary estimated derived using fixed-effect meta-analysis. Two diagonal lines represent (pseudo) 95% confidence limits (effect±1.96 SE) around the summary effect for each standard error on the vertical axis. These show the expected distribution of studies in the absence of heterogeneity or of selection bias. In the absence of heterogeneity, 95% of the studies should lie within the funnel defined by these diagonal lines”22,23.
Publication bias results in asymmetry of the funnel plot; smaller studies usually show the larger effects. However, funnel plot may not always be a reliable tool, in particular, when the number of studies included in the analysis is small.
In meta-analysis, type II error occurs when a true effect is unrecognised. This is often associated with meta-analysis done with small number of studies and is of great concern when compared to type I errors. Statistical power is the probability of meta-analysis detecting the expected effect, if the effect actually exists. If a mean effect size is approximately zero, no significant heterogeneity exists among the studies, or it is not concluded that a variable moderated the effect size, it becomes important to exclude lack of statistical power. Depending on the specific mean effect size difference (d) and the corresponding standard error (σ) of a study, the power varies from one study to another. Large powers are indicative of studies where each d is large, and σ is small which is an indication that the studies will likely identify effects when they are large and/or report a large amount of information.
To derive the power of the individual studies that contributes to the meta-analysis, the within-study standard errors are not estimated prior to performing the meta-analysis. Instead, the normal within-study approximations, Yi ∼ N(µi, σI, is used (µi denote the true effect in study i, Yi is the study’s estimate of µi and σi is the corresponding standard error). It is also assumed that two-tailed hypothesis tests are applied. The test statistic, H0 : µi = µ0 versus H1 : µi ≠ µ0 in the ith study is given by
For no effect, µ0 = 0
H0 : µi = µ0, Zi ∼ N(0, 1) …. null hypothesis
Zi ∼ N(δi/σi, 1) where δi = µi − µ0 …. alternate hypothesis
Using a 2-tailed test, the null hypothesis is rejected by the ith study if |Zi| < Za, and accepted if |Zi| < Za, where Za is the critical value from a standard normal distribution given by Za = 1.96; Za is the conventional 5% significance level assumed to have been used in the analysis. The probability (p) of accepting the null hypothesis is therefore given by
where ϕ is the standard normal cumulative distribution function. In Microsoft Excel 2010,
ϕ = NORM.S.DIST( z, TRUE) (32)
And in earlier versions of MS Excel,
ϕ = NORM.DIST( z, TRUE) (33)
where z = is the value at which the function is to be evaluated; at 95% confidence interval, z = 1.96 and ϕ = 0.975.
Thus, the power for a two-tailed test for a fixed-effect model is calculated as the probability (p) of correctly rejecting a false null hypothesis given by
For a randon-effect model, σi will be replaced with VR (Equation (3)) which gives
There is no doubt that meta-analysis provides very useful information in making decisions in the practice of pharmacy. Results from the integration of small number of studies should be accepted with some caution, even if the p value indicates extreme statistical significance. So long as studies are well conducted, those involving several hundred of events are more likely to be reliable and clinically useful. Overall, the application of individual patient data in meta-analysis may always provide the best evidence of treatment effects in cohort studies and in clinically important subgroups.
In reporting, it is important to provide the following d, r2, mean of d and the 95% Confidence Interval (CI) of the d, Q, df, p value for Q, I2 and 95% CI of I2 as well as the power of the study.
The authors are grateful to Biotech Origin, University of Benin, Benin City for their financial support.
Conflict of Interest
No conflict of interest associated with this article.