Quantifying Effect Sizes in Randomised and Controlled Trials: A Review

Department of Clinical Pharmacy and Pharmacy Practice, Faculty of Pharmacy, University of Benin, Benin City, Edo 30001, Nigeria; patrick.erah@uniben.edu Department of Clinical Pharmacy and Pharmacy Practice, Faculty of Pharmacy, University of Ilorin, Ilorin, Kwara, Nigeria; sibello10@yahoo.com Department of Clinical Pharmacy and Pharmacy Practice, Faculty of Pharmacy, Niger Delta University, Wilberforce Island, Bayelsa, Nigeria; pharmkenny@gmail.com


Introduction
"Meta-analysis is a statistical analysis" developed by Glass 1 in 1976 to perform a "relatively powerful evaluation of a specific hypothesis and to draw quantitative inferences. It integrates the quantitative findings from multiple scientific, but similar studies, and provides a numerical estimate of the overall effect of interest" [1][2][3] . In randomised and controlled trials, the effect of interest can be "(i) an average of a continuous variable, (ii) a correlation between two variables, (iii) an odds ratio (suitable for analyzing retrospective studies), (iv) a relative risk (risk ratio) or risk difference (suitable for analyzing prospective studies), or (v) a proportion". Randomised studies are often considered to reduce bias problem while studies with controls are the ones selected because effect sizes of the control and treatment groups are the ones that are compared. While a study may combine many studies to determine the effect size of a particular outcome (e.g., cure of malaria), another may compare different effects (e.g., cure of malaria, incidence of recrudescence, ant malarial resistance, side effects, etc) from the same sets of studies included in the study design.
The basic principle behind meta-analyses lies on a common fact behind all conceptually similar scientific studies measured with a certain error which is within the individual studies. Approaches from statistics are then "applied to derive a pooled estimate nearest to the unknown common fact based on how the error is perceived" [4][5][6] . Different weights are usually assigned to the different studies for calculating the pooled effect. This weighting is related with the inverse of the variance and hence indirectly to the sample size reported in the studies. Any set of studies with smaller standard deviation and larger sample size are given more weight in the calculation of the pooled effect size. "The agreement or disagreement between the studies is examined using different measures of heterogeneity which refers to the variation in study outcomes between studies" [4][5][6] . Other than providing estimate of the unknown common truth, "metaanalyses have the ability to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies" [6][7][8] .
The major benefit of this approach is the aggregation of useful information that leads to a higher statistical power and more reliable point estimate when compared to the measure derived from any individual study 7 . Even a low-powered meta-analysis utilizing a small number of studies can still provide useful information. Thus, researchers are often motivated to include meta-analysis in systematic reviews for reasons which include (1) increasing power, (2) to improve precision, (3) to answer questions not posed by the individual studies, and (4) to settle controversies or generate new hypotheses 6 .

Models in Meta-Analysis
Two models are commonly used in meta-analysis namely, the fixed-effect (common effect) and random-effect models. Under the fixed-effect model, it is assumed that all studies included (i) investigate the same population, (ii) apply the same variable and outcome definitions, (iii) have one true effect size that underlies all the studies in the analysis, and (iv) all differences in observed effects are due to sampling error. The inverse of the variance from the weighted average of different study estimates (w i = 1/σ 2 where w i is the weight of individual study, and σ is the variance within the studies) "is commonly used as study weight, such that larger studies tend to contribute more than smaller studies to the weighted average". Thus, when studies included a meta-analysis are "dominated by a very large study, the findings from smaller studies are practically ignored". However, the assumption here is considered unrealistic since research is often disposed to different sources of heterogeneity-"a measure of the level of inconsistency in different studies". In the random effect model, the assumption is that "the true effect size might differ from one study to another. For example, the effect size might be higher (or lower) in studies where the participants are older, or more educated, or healthier than in other studies, or when a more intensive variant of an intervention is used". Random effect is simply the weighted average of the effect sizes of a group of studies [w i = 1/(σ 2 + τ 2 ) where τ is the variance between the studies]. The term, "Random", "reflects the fact that the studies included in the analysis are assumed to be a random sample of all possible studies that meet the inclusion criteria for the review" 8 . This implies that the greater heterogeneity, the greater the un-weighting which can get to a point when the random effects estimated become simply the un-weighted average effect size of the studies 8 .

How to Carry Out Meta-Analysis
The Cochrane handbook 6 , and the PRISMA statement 8 as well as the report from Howard et al., 9 provide appropriate information on how to carry out systematic review and meta-analysis. In general, the steps provided in Table  1 are useful.
When studies with poor methods are included in the data set for meta-analysis, the ability of the meta-analyst to compute a strong mean effect size or identify important controlling variables may be compromised. It is therefore crucial to define and report the criteria on how studies are assessed for inclusion. In addition, it is important to thoroughly consider whether the studies being considered for inclusion can reasonably be combined 11 . In other words, the following questions may be asked: (1) Do the studies consider common outcome? (2) Were the outcomes measured in a similar way? (3) Were the effect sizes determined using the same outcome measure? These issues may be less important in some fields such as ecology when compared to others such as pharmacy and medicine.
That effect sizes are independent is a very important statistical assumption in meta-analysis 12 . "Statistical independence implies that each effect size (or sample) represents an independent entity and the pooled effect size does not have a correlated structure. Non-independence is a major consideration in data set as it can affect (i) the calculation of effect size statistics and (ii) the estimations of overall meta-analytic estimates with their uncertainty which are two major interrelated components of a meta-analysis" 13 . Potential sources of non-independence include multiple effect sizes extraction from a single experiment or from different time points throughout a study, an effect being measured on each individual or simulated in a study and research from multiple species, and influence of research group. Non-independence can increase type I error rates in meta-analysis. In statistical hypothesis testing, a type I error occurs when a true null hypothesis is incorrectly rejected (false positive) while a type II error is the acceptance of a false null hypothesis (false negative).

Methods
Search was conducted online primarily using Google and PubMed databases to retrieve relevant articles on the different methods being used to calculate effect sizes and the associated confidence intervals, effect size correlation, p values and I 2 , as well as how to evaluate heterogeneity and publication bias based on available records in the databases as at May 2017. The search terms used included, 'effect size calculation, 'effect size and clinical trials' , 'calculation of effect size in clinical trials' 'randomised clinical trials and effect size' , 'effect size correlation' , 'heterogeneity and effect size' , 'publication bias' , 'publication bias in clinical trials' , 'fixed effect and effect size' , and ' random effect and effect size' . All articles retrieved that were not in English language and had no relevant information on effect size, heterogeneity, and publication bias were excluded while the rest were evaluated to extract relevant information, and as appropriate, used to identify other relevant articles.
Data were independently extracted by the corresponding author and verified by another author.

Calculation of Effect Size in Meta-Analysis
Several ways have been used to calculate effect size, but the three most popular approaches are those of Gene Glass, Hunter-schmidt, and Cohen's d 14 . While these different methods of calculation will not necessarily yield the same d values for a set of data from studies included in a study, the use of a particular method across all the studies being considered in a study will effectively compare the effect sizes of the individual studies.
In fixed-effect model, when a study reports the mean and standard deviation (variance) of a treatment and control groups, Cohen's d can be used to calculate the standardised difference between the two means as follows: Step 1 A thorough literature search for studies that address the hypothesis of interest, using defined keywords and search methods is performed. This will usually include searching for unpublished studies, for example by posting requests to professional manufacturers, newsletters or mailing lists. The research question can be formulated in terms of the problem/population, intervention, comparison, and outcome (PICO) 10 .
Step 2 The resulting studies are critically appraised and evaluated for possible inclusion in the review. Possible questions to be addressed for each article include: Is the publication applicable? Are the study methods appropriate? Is there enough information to calculate an effect size? (Record the reasons for dropping any studies from your data set).
Step 3 An appropriate measure of effect size is selected, and the effect size is calculated for each study retained.
Step 4 The selected studies are entered into a master database; information to be recorded should include study identity (author, and year), effect size(s), sample size(s) and information which codes each study for variables which may affect the outcome of each study, or whose possible influence on effect size needs to be investigated (experimental design, taxonomic information on the study species, geographic location of study population, life-history variables of the species used etc). How the effect size(s) is/are calculated for each study is also recorded.
Step 5 A summary of the cross-study support for the hypothesis of interest is done using meta-analytical methods. Also, any variation in conclusions drawn by individual studies is explained.
Step 6 The robustness and power of the analysis (likelihood of type I and type II errors) are determined" 8,9 1 2 2 pooled σ + σ = σ (2) where d is the effect size, µ 1 and µ 2 represent the means of the effects of the treatment and control group, σ 1 and σ 2 are the standard deviations of the means of the effects of the treatment and control and σ pooled is the pooled standard deviation. However, in random-effect model, the introduction of the between studies variance (τ) will change σ to V R where 2 2 ( ) where 2 2 2 1 2 τ = σ + σ based on 'variance sum law' for independent variables.
When a study reports a percentage of success after taking the treatment or no treatment (hit rate), the following formula 15 can be used: where p1 and p2 are the hit rates of the control and treatment groups, depending on the direction of the desired effect. The arcsine is the inverse of sine and the returned angle is given in radians in the range of -π/2 to π/2. In Microsoft Excel, this value is calculated as where p is the proportion which must be from -1 to 1. Using the t test value for a between subjects t statistics and the degrees of freedom When the studies list F statistics, d can be calculating as follows: where MSE is the mean square error, n is the number of subjects in the treatment (t) or control (c) group, and F is the reported F statistics usually given, for example, with the notation, F (df c ,df s ) = f x , where df c is the degrees of freedom based on the number of conditions, df s is the degrees of freedom based on the number of subjects and f x is the F value [e.g., F (1,39) = 3.12] Effect-size correlation (r) is obtained from The r and r 2 are the proportion of the variance in the sample or control 'accounted for' by the other --this is the proportion of reduction of the variance of the outcome measure when it is replaced by the residuals' variance values obtained from a regression equation. When this is extended to multiple regressions, it characterizes the proportion of the variance accounted for by all the independent variables; similar to ANOVA where it is often called 'eta-squared' , η 2 . Thus, r 2 is often advocated as a universal measure of effect size. It is important to note that the means in the above equations are arranged in the direction of the effects. For example, if desired effect is increase in effect size, the control mean will be subtracted from the treatment mean. Thus, d and r are positive if the mean difference is in the predicted direction.
If the effect size estimate from the sample is d, then it is normally distributed, with the following standard deviation: where n t is the number in the experimental group while and n c is that of the control group. Hence a 95% confidence interval for d would be from d -1.96 × σ to d + 1.96 × σ (13)

Odds Ratio (OR) and Relative Risks (RR)
OR and RR are other possible indices of effect in group designs. An example of a report of meta-analysis where OR was used in the estimation is shown in Figure 1 5 . Using both fixed and random effect models, an example of a "forest plot" from meta-analysis of different studies is illustrated in Figure 2. Odds ratio reflects the odds of a successful or desired outcome in the intervention group relative to the odds of a similar outcome in the control group. Consider the following 2x2 frequency table (Table 2).  (15) Absolute risk reduction (ARR) = ARC -ART (16) where ARC is the AR of events in the control group and ART is the AR of events in the treatment group.  The standard deviation (σ) of OR or RR can be calculated from where M is the OR or RR.

Heterogenicity and Publication Bias
Irrespective of the assumed quality of meta-analysis in research, the reliability and strength of any inferences derived from it rely on the population of individual studies included. Thus, in reporting meta-analysis, issues relating to which studies are included are vital. At the same time, it is essential to understand some approach of evaluating that the tendency for a true significantly non-zero mean effect size is the outcome of a type I error. Also, the likelihood of a zero mean effect size is the outcome of the absence of a statistical power rather than realistic reflection of the mean effect size of the population. Both type I and II error rates are affected by the number and identity of the included studies and their individual sample sizes 11 .

Heterogenicity
In meta-analysis, heterogeneity is the variability occurring in outcomes in different studies. It is a consequence of clinical or methodological differences (or both) among the studies 6 . Statistical tests of heterogeneity are very popular in meta-analyses reports despite their well know limitations. Cochran's Q is a classical measure of heterogeneity in different studies 16 .
where w i = 2 2 1 σ + τ which is the study weight, d i is the individual study effect size, d m is the mean of effect size for the studies, k is the number of studies, σ is the variance within studies (σ[d i ]) and τ (tau) is the variance between the studies (σ[d m ]). As indicated in the equation, Q is the weighted sum of squared differences between each study estimate and the pooled estimate, with the weights being those used in the pooling method. "It is distributed as a chi-square statistic with k -1 degrees of freedom (df) where k is the number of studies" 16 . For fixed effect model, τ = 0 (i.e., w i = 1/σ 2 ) as it is assumed that there is no variability within the studies unlike random effect model. One commonly useful statistics for calculating inconsistency in studies is I 2 .
where n t and n c are the total numbers of individuals in all the studies in the treatment and control groups, respectively.
where Q is the chi-squared statistic and df is its degrees of freedom 4,5 . This equation describes the percentage of the variability in effect estimates that is due to heterogeneity rather than sampling error (chance). If the I 2 estimate from the studies is y, the standard deviation for the distribution is given by where n t and n c is the total number in the experimental group while n c is the total number in the control group. Hence a 95% confidence interval for I 2 would be from The thresholds for the interpretation of I 2 can be misleading, as the importance of inconsistency depends on many factors. It should be noted that a low value of I 2 could have only trivial heterogeneity but could also have substantial heterogeneity. However, 0% to 40% might not be important, 30% to 60% may represent moderate heterogeneity, 50% to 90% may represent substantial heterogeneity and 75% to 100% considerable heterogeneity. These cut-off points depend on magnitude and direction of effects and strength of evidence for heterogeneity such as P value from the chi-squared test, or a confidence interval for I 2 .
However, I 2 values of 25%, 50%, and 75% can be assumed to correspond to small, moderate, and large sizes of heterogeneity.
In Microsoft Excel, the function to compute a p-value for Q is Thus, if Q = 13.4626 and df = 1, p = CHIDIST (13.4626,1) = 0.0002. Usually, if p < 0.05, the difference is assumed to be be 'significant' .

Publication Bias
This often represents the highest potential source of type I error (i.e., false positive) in meta-analysis. Over recent years, different nomenclatures have been developed for bias relating to publication bias. These include the selective exclusion of patients from the analysis 17 , outcome reporting bias 18 , time lag bias 19 , and location bias 20,21. A funnel plot (Figure 3) "is a graphical tool commonly used for detecting bias in meta-analysis and systematic reviews. In this plot, treatment effect is plotted on the horizontal axis and the standard error on the vertical axis and the vertical line represents the summary estimated derived using fixed-effect meta-analysis. Two diagonal lines represent (pseudo) 95% confidence limits (effect±1.96 SE) around the summary effect for each standard error on the vertical axis. These show the expected distribution of studies in the absence of heterogeneity or of selection bias. In the absence of heterogeneity, 95% of the studies should lie within the funnel defined by these diagonal lines" 22,23 . Publication bias results in asymmetry of the funnel plot; smaller studies usually show the larger effects. However, funnel plot may not always be a reliable tool, in particular, when the number of studies included in the analysis is small.

Power
In meta-analysis, type II error occurs when a true effect is unrecognised. This is often associated with meta-analysis done with small number of studies and is of great concern when compared to type I errors. Statistical power is the probability of meta-analysis detecting the expected effect, if the effect actually exists. If a mean effect size is approximately zero, no significant heterogeneity exists among the studies, or it is not concluded that a variable moderated the effect size, it becomes important to exclude lack of statistical power. Depending on the specific mean effect size difference (d) and the corresponding standard error (σ) of a study, the power varies from one study to another. Large powers are indicative of studies where each d is large, and σ is small which is an indication that the studies will likely identify effects when they are large and/or report a large amount of information.
To derive the power of the individual studies that contributes to the meta-analysis, the within-study standard errors are not estimated prior to performing the meta-analysis. Instead, the normal within-study approximations, Y i ∼ N(µ i , σ I , is used (µi denote the true effect in study i, Y i is the study's estimate of µ i and σ i is the corresponding standard error). It is also assumed that twotailed hypothesis tests are applied. The test statistic, H0 : µ i = µ 0 versus H1 : µ i ≠ µ 0 in the i th study is given by For no effect, µ 0 = 0 H0 : µ i = µ 0 , Z i ∼ N(0, 1) …. null hypothesis Z i ∼ N(δ i ∕σ i , 1) where δ i = µ i − µ 0 …. alternate hypothesis Using a 2-tailed test, the null hypothesis is rejected by the i th study if |Z i | < Z a , and accepted if |Z i | < Z a , where Z a is the critical value from a standard normal distribution given by Z a = 1.96; Z a is the conventional 5% significance level assumed to have been used in the analysis. The probability (p) of accepting the null hypothesis is therefore given by where z = is the value at which the function is to be evaluated; at 95% confidence interval, z = 1.96 and ϕ = 0.975. Thus, the power for a two-tailed test for a fixed-effect model is calculated as the probability (p) of correctly rejecting a false null hypothesis given by

Conclusion
There is no doubt that meta-analysis provides very useful information in making decisions in the practice of pharmacy. Results from the integration of small number of studies should be accepted with some caution, even if the p value indicates extreme statistical significance. So long as studies are well conducted, those involving several hundred of events are more likely to be reliable and clinically useful. Overall, the application of individual patient data in meta-analysis may always provide the best evidence of treatment effects in cohort studies and in clinically important subgroups. In reporting, it is important to provide the following d, r 2 , mean of d and the 95% Confidence Interval (CI) of the d, Q, df, p value for Q, I 2 and 95% CI of I 2 as well as the power of the study.

Conflict of Interest
No conflict of interest associated with this article.