

Adolescent Gay / Lesbian Suicide Risk Supplemental Information for the 2013 Paper: "Suicide Risk and Sexual Orientation: A Critical Review." Archives of Sexual Behavior, 42(5): 715727. Online First (Feb. 26, 2013) Authors: Plŏderl M, Wagenmakers EJ, Tremblay P, Ramsay R, Kralovec K, Fartacek C, Fartacek R. ResearchGate Full Text.  Draft of Paper.  Dr. Martin Plöderl Website. PDF Download (Alternate Link, Download Page) (PDF Download): Basic Information Supplement Submitted For Peer Review. Most of the Information is Reproduced & Expanded Upon in This Section. A Critical Examination of the Shaffer et al. (1995) & Renaud et al. (2010) Psychological Autopsy Studies of Adolescent Suicides & An Expanded Homosexuality Factor in Adolescent Suicide 
Webpages Available At Website 
Suicidality Studies Index: All Studies: The Index.  All Random & Special Sample Studies.  All American & Canadian Studies.  All European Studies.
 Transgender Studies.
 The Results of Additional SchoolBased North American Youth Risk Behavior
Surveys or Similar Surveys  Random Sampling  are Located at Another Location. Other Pages: Homosexually Oriented People Are Generally at Greater Risk for the More Serious Suicidal Behaviors.  "Attempting Suicide" as Related To Gender Nonconformity & Transgender Issues.  Bell & Weinberg (1978) Homosexualities Study: "Attempted Suicide" Study Results. Special Section: The 2013 Paper, "Suicide Risk and Sexual Orientation: A Critical Review," Reverses the Conclusions of Two Previously Published Papers. The ReAnalysis  Including Many MetaAnalyses & Using Unconditional Tests for Statistical Significance  Indicates that "Gay/Lesbian/Bisexual Adolescents Are at Risk for Suicide." (This Page)  In Addition, Expanding the "At Risk" Category to Include Adolescents Known to Only Have Been Harassed/Abused  Because They Were Assumed to be Gay/Lesbian  Produces More Conclusive Results, Especially Applying for Males. This Category Represents "An Expanded Homosexuality Factor in Adolescent Suicide." (This Page)  Associated Pages: Constructing "The Gay Youth Suicide Myth": Thirty Years of Resisting a Likely Truth & Generating Cohen's Effect Size "h" Via Arcsin / Arcsine Transformations. 
The Related 2013 American Association of Suicidology Conference Card Handout: 
 Note: On the webpage there are multiple metaanalytic combination of varied results from two psychological autopsies of adolescent suicide deaths. Complications were generally related to the presence of "zero events" in one cell of each study that required special considerations and analytical techniques available in the statistics literature. Some readers may be surprised that only two studies are in the metaanalyses, but using the methodology is the best available to combine two studies, said to be acceptable especially if the latter of the two studies is a replications of the first study  as it applies herein (Valentine et al., 2010). Issues related to appropriate Null Hypothesis Testing as related to "Low Counts" / "Rare Events" studies are presented and related problems are addressed.
 Had the authors of the two studies used more appropriate null hypothesis tests, they might have been concluded that sexual minority adolescents were likely more at risk for dying by suicide. However, it was concluded that such a risk difference did not exist. After exploring these issues, it is shown, via using more advanced statistical methods  Bayesian Analyses, Metaanalysis and more powerful significance tests  that the higher suicide risk for sexual minority adolescent most likely exists.
 Introduction: The Basics in the Shaffer et al. (1995) & Renaud et al. (2010) studies and Related Issues: Null Hypothesis Testing Misunderstandings / Problems & Ignoring potentially important observed differences.
 Problems With The Commonly Used Conditional Fisher Exact Test: The More Appropriate Use of Unconditional Tests and Midp Values.
 Shaffer et al. (1995) & Renaud et al. (2010): Reported Study Data & Conditional Fisher Exact Test Results. Tabled Results of pValue Unconditional Tests & Other Null Hypothesis Tests.
 A Good Summary of Recommended Statistical Significance Tests for 2X2 Tables.
 A Good Explanation of Why the Barnard Exact Test Should be Used Instead of the Fisher Exact Test When Comparing Results of Two Small Independent Binomial Samples.
 Barnard's Test for Two Independent Binomials & MidP Values.
 Papers Related to Using MidP values.
 Roger Berger's Online PValue Calculator: To calculate the Fisher's Exact Test as Used in Boschloo (1970) & The Zpooled or Zunpooled PValues from Suissa & Shuster (1985).
 Online Calculators to Generate Null Hypothesis Unconditional Tests PValues & Other Statistics.
 The Shaffer et al. (1995) & Renaud et al. (2010) psychological autopsy studies of adolescent suicides only explored the suicide results for adolescents deemed to be homosexually oriented, but the evidence suggests that, in addition to these adolescents, adolescent not known to be homosexually oriented  but targeted for harassment based on other adolescents' beliefs that they are gay or lesbian (related information was solicited in both studies)  would likely also be at greater risk for dying by suicide. Hence, five new analyses are carried out with the data from the two studies to explore the suicide risk of adolescents who are homosexually oriented combined with those targeted for antigay harassment but were not known to be homosexually oriented. Using more appropriate null testing methods as it was done in the first analysis (Statistical Significance Tests) and more advanced statistical methods  Bayesian Analyses & Metaanalysis (Table 1a)  Arcsine Difference MetaAnalysis (Table 1b)  and Odds Ratio MetaAnalysis using Continuity Corrections, two methods: (Table 1c, Table 1d) and the Peto Method deemed to be the best available for studies with low counts and maybe zero events (Table 1e)  it is shown that these sexual minority adolescents are most likely at a greater risk for suicide. The five analyses are also carried out only for the males in both studies in which homosexually oriented males  combined with males reported to have been targeted for antigay harassment  accounted for the great majority of suicide deaths (all or almost all) in the homosexualityrelated category. Results for homosexuality related males are generally more indicative of greater suicide risk for suicide than they are for homosexuality related males and females analyzed together (Tables 2a, 2b, 2c, 2d, 2e).
"Presenting results that “support” a null hypothesis requires more detailed statistical reporting than do results that reject the null hypothesis. Additionally, a change in thinking is required. Null hypothesis significance testing do not allow for conclusions about the likelihood that the null hypothesis is true, only whether it is unlikely that null is true."For peer reviewers of scientific papers, Elsevier Publishers make available the same advice via a document produced by Tony Brady (2005/2008):
"Authors with 'negative' results (i.e. found no difference) should not report equivalence unless sufficiently proven  "absence of evidence is not evidence of absence.""WikiAnswers effectively replied to the question "Does a hypothesis test ever prove the null hypothesis?
"I could be mistaken on this, but research is not usually, if ever, designed to "prove" the null hypothesis. The idea, hope or expectation is that your design will give you sufficient reason to "reject" the null hypothesis. Under virtually all circumstances, it would be shabby work to take a study designed to do one thing and automatically conclude that undesired results "prove" some alternative, including the null hypothesis. The best you can do is to conclude that the null hypothesis cannot be rejected  which is a far cry from proving it. Another take on this is that the null hypothesis is really an abstraction, and not of any practical use in and of itself."Schlag (2011) offers important Null hypothesis testing information, with related warnings:
2.5 Accepting the Null Hypothesis: Null hypotheses are rejected or not rejected. One does not say that the null hypothesis is accepted as apparent in the conclusions made in the Shaffer & Renaud studies. Why? Not being able to reject the null hypothesis can have many reasons. It could be that the null hypothesis is true. It could be that the alternative hypothesis is true but that the test was not able to discover this and instead recommended not to reject the null [will be shown to apply in the Shaffer & Renaud studies]. The inability of the test to discover the truth can be due to the fact that it was not sufficiently powerful, that other tests are more powerful [these tests to be given later with results generated for the Shaffer & Renaud studies]. It could be that the sample size is not big enough so that no test can be sufficiently powerful [applies in the Shaffer & Renaud studies].Gelman & Stern (2006) emphasize an important danger related to statistical testing:
"As well, introductory courses regularly warn students about the perils of strict adherence to a particular threshold such as the 5% significance level. Similarly, most statisticians and many practitioners are familiar with the notion that automatic use of a binary significant/ nonsignificant decision rule encourages practitioners to ignore potentially important observed differences."
 The Shaffer and Renaud study data produced OneSided Fisher Exact Test pvalues of 0.09 and 0.06, respectively, meaning that the criteria was met for not rejecting the Null Hypothesis (p >= 0.05), the pvalue being defined as:
 "The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true." (Sestini & Rossi, 2009)
 "[T]he probability of obtaining the difference observed, or one that is more extreme [Emphasis mine, see related excerpt, Hubbard & Lindsay, 2008], considering the null is true." (Biau et al., 2010)
 A p < 0.05 would mean that the Null Hypothesis is rejected and this also "means there is only a small chance that the obtained result would have occurred if H_{0} [the Null Hypothesis] were in fact true" (Aberson, 2002a: 37).
 Although p >= 0.05 is the criteria being met for not rejecting the Null Hypothesis, such pvalues do not mean that the Hull Hypothesis is true! Why? Because we began with this assumption: "The Null Hypothesis is true!" Therefore, the best that can happen after that, given that all calculations related to a study are based on the Null Hypothesis being true, is to produce evidence that the Null Hypothesis may be true, expressed as pvalues being >= 0.05 (5%), meaning it should not be rejected.
 Note that not rejecting the Null Hypothesis  as in therefore assuming that the differences in a study are not statistically significant  could be an error... a Type II Error that has long been reported to be common  also called "conservatism"  with the Fisher Exact Test (Hirji et al.. 1991; Mehrotra et al., 2003; Hasselblad & Lokhnygina, 2007).
 A p > 0.05 value means: "the probability is greater than 1 in 20 [5 in one hundred, 5%] that a difference this large or larger [Emphasis mine, see related excerpt, Hubbard & Lindsay, 2008] could occur by chance alone” (Goodman, 2008: 137), when assuming that the Null Hypothesis is true. Had the results been < 5% (0.05, the chosen α value), we could have said that the study result was unlikely due to chance alone, as based on the assumed Null Hypothesis. The situation is more like the probability that "only the chance factor" applies was so low... less than 1 in 20 (< 5%, p < 0.05)]... that the following conclusion would ensue: there are therefore some inherent or real difference between the groups in a study and that the assumed Null Hypothesis is to be rejected. Given that the 'criteria' is < 5% for rejecting the "Null Hypothesis"  that is quite demanding (especially for low count samples)  the "Null Hypothesis" was not rejected in both studies. As discussed above, the p > 0.05 results in both studies also do not mean that the Null Hypothesis is true, nor does such testing "allow for conclusions about the likelihood that the null hypothesis is true" (Aberson, 2002a: 36)
 Nonetheless, there was still a low probability that the results would have occurred by "chance alone" when in fact the NullHypothesis was true is low in both studies: 9 / 100 (0.09, 9%: one & twosided)) & 6 / 100 (0.06, 6%: onesided)  11.8 / 100 (0.118, 11.8%: twosided). Even if the Null Hypothesis was not to be rejected, it was still possible that the suicide groups and controls were maybe inherently different (also meaning a possible Type II Error: not rejecting the Null hypothesis when it should have been rejected, this being a common problem with the Fisher Exact Test: Cashen & Geiger, 2004). Maybe then the differences  even if being declared statistically nonsignificant  required a more critical evaluation. For example, the Renaud study's 6% result (p = 0.06) was maybe not necessarily the result of the magnitude of difference, but at least partly because pvalues are more likely to be > 0.05 and in error when study samples are small, events are rare, and differences are small to moderate, or even large (Mehrotra et al., 2003). This issue will be discussed later.
 For the Renaud study N's (4 / 55 vs. 0 / 55), it was also possible that the sample was large enough, but that using the Fisher Exact Test was inappropriate. That is, more appropriate  and more powerful  Null Hypothesis tests are maybe available for studies with low counts and when zero events occur. This will later be shown to apply in both the Shaffer and Renaud studies.
 For more information about Null Hypothesis testing, pvalues and related issues, see: Cook (2010), Goodman (1999), Hubbard & Armstrong (2006), Hubbard & Bayarri (2003, 2003a), Moran (2006), Moran & Solomon (2004), Panagiotakos (2008), Sellke et al. (2001), Senn (2001), Stang et al. (2010), Sterne (2002), and Verhagen et al. (2004).
One problem in the two studies is that we have the minimum "sexual minority" counts needed in the 2 suicide groups to begin doing more revealing statistical work, but only if we had similar counts in the matched control groups where there are "0" counts. We know, however, that, given larger sizes for the control groups, "sexual minority" counts would eventually appear, noting that these counts are determined by 'researcher selected' informers for both groups such as having a parent or close friend, for example, reporting that an adolescent who is in the living control group, as it was done for those who died by suicide, is/was a nonheterosexual / homosexually oriented adolescent. If the counts were low, this would mean using appropriate statistical methods (tests of statistical significance  that will reject the null  the "no difference"  hypothesis) to show that a greater risk does exist. With more counts, however, in both categories, and especially in categories that might have "0" counts, it would be possible to answer the most important question. This happens to be calculations of Odds Ratios (ORs)  that are a direct measure of effect sizes  that represent the magnitude for the greater risk that sexual minority adolescents would at risk of dying by suicide. For example, we might be able to state that the odds of sexual minority adolescents dying by suicide is 6times greater than for heterosexual adolescent, with the 95% probability that identical repeated studies will produce OR values from maybe 2 to 12 (the confidence intervals). The above exploration of counts in both studies 'loosely' suggest that, had the researchers increased the size for control sample, ORs of about 7 or 8 might have been produced with statistically significant or nearsignificant results, and that the absolute minimum OR would have been about 4, noting that this lowest OR would be unlikely.
“In 1995, I was anxiously awaiting the results from researchers I had paid to investigate if sexual minority adolescents are at greater risk for dying by suicide. When they finally reported their results, I was told that there were only 3 "sexual minority" deaths in the suicide group, and no "sexual minority" individuals in the control group (n = 147). They also informed me  as the result of the onesided Fisher Exact Test (p = 0.09)  that, even if it looked like there are differences in the two group, the nonsignificant results of the test meant that there are no significant differences between the groups and that sexual minority adolescents are NOT at greater risk for dying by suicide. This result troubled me for years because, from 'somewhere' within me, I felt that there were differences: sexual minority adolescents in the suicide group and none in the control group.It is now important that, to achieve an appropriately powered study sample, similarly conducted psychological autopsy studies willl likely need to have a minimum of about 100 adolescents in the suicide group, with control group sizes selected so that there is an absolute minimum of about "one count" for sexual minority adolescents; however, a count of 3 might be a preferred minimum given estimate uncertainties when events range from 0 to 2. Meanwhile, we are left with the possibility that such new studies might produce Odds Ratios (ORs) from 5 to 8. This is possible given the results of 'playing'  as done above, with what what might be the sexual minority counts if the control group in both studies had been larger (producing ORs from about 5 to 8)  or by adding a small amount to each cell (a continuity correction, such as 0.5)  when there is a zero in one cell that is making the calculation of an odd ratio (OR) impossible. This procedure produces ORs of 8 to 10 (but with wide confidence intervals), as given below in the MantelHaenszel MetaAnalysis.
About five years later (2000), I asked the same researchers to do a similar study because I wanted to know if sexual minority adolescents might have since become at greater risk for dying by suicide. When they reported their results to me in 2010, I simply could NOT believe what they had done. It was obvious to me that there had been a low count problem with the first study, but this time they had compared samples that were even smaller: 55 vs. 120 adolescent suicides in the first study, and 55 vs. 147 for the control groups, respectively. Fortunately, there were a greater percentage of "sexual minority" individuals in the suicide group (4 / 55 = 7.3%, compared to 3 / 120 = 2.5% in the first study) that would make a statistical analysis again possible, but there were still "0" sexual minority adolescents in the control group (n = 55) that certainly could have been expected given that the previous study had produced the same result (0 counts) in a sample about 3times the size: 147 adolescents. Again, I was told what had been said about the first study: 'as the result of the Fisher Exact Test (p = 0.06)  even if it looked like there are differences in the two group and even if results are very close to statistical significance, the nonsignificant results of the test means that there are no significant differences and that sexual minority adolescents are NOT at greater risk for dying by suicide.'
I felt uneasy about the reported research outcomes because there were always adolescents in the suicide groups and none in the control groups:
I even began to wonder if, for some reason, these researchers might have a biased agenda in maybe never wanting to report that "sexual minorities" are at greater risk for suicide. Yet, it seemed to me that they were at a greater suicide risk given the numbers produced in the two studies. I therefore decide to get second opinions and also educate myself about statistical testing. I finally paid researchers well versed in sexual minority suicidology and statistics to reevaluate the two studies. The great surprise was that the researchers who did the original two studies had been misusing statistical testing: it was inappropriate for them to declare that 'nonsignificant statistic test result also meant that there were actually no difference between the two groups'. Even worse, they did not mention anything about the many reports, common by the 1980s, but also existing before then, that the Fisher Exact Test often produces very conservative estimates when used with small independent binomial samples. This means that other more appropriate statistical significance testing procedures might show that, for the studies, statistical significance may exist: that the Null Hypothesis should be rejected. By using improved methodologies in statistics, they showed that, in spite of the low counts in both studies, the new statistical significance results indicated that sexual minority adolescents were at greater risk for suicide.
The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Counts in the Study Sample
Categories
Shaffer et al. (1995)
Renaud et al. (2010)
Suicide Sample
Control Sample
Suicide Sample Control Sample Adolescent Sample
Sizes (N)
120
147
55
55
Sexual Minority
Individuals (n)
3
0
4
0
% Sexual
Minority
2.5%
0.0%
7.3%
0.0%
Basic Observation: There are always sexual minority adolescents in the suicide group, and always none in the control group.
The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Counts With Continuity Correction* & Related MetaAnalysisCategories
Shaffer et al. Renaud et al. Suicides
Controls
Suicides
Control
Homosexual: Yes
3 + 0.5 =
3.50 + 0.5 =
0.54 + 0.5 =
4.50 + 0.5 =
0.5Homosexual: No
117 + 0.5 =
117.5
147 + 0.5 =
147.551 + 0.5 =
51.5
55 + 0.5 =
55.5Odds Ratio (95% CI)
8.79 (0.45  171.81) ^{1}
9.02 (0.46  176.34) ^{2}
9.70 (0.51 184.62) ^{1}
9.70 (0.51  184.62) ^{2}
Two Studies Combined
Via MantelHaenszel MetaAnalysisOR: 9.37 (1.15  76.03)  Fixed effect. ^{2}
OR: 9.36 (1.15  75.85)  Random Effects. ^{2}
1. OR Calculations (Taylor Series): DB Wilson. OpenEpi.
2. ORs & Metaanalyses carried out with R program: Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* MantelHaenszel odds ratio method: adding 0.5 to each cell if zero in one cell. This is not a good method to use with rare events study results according to The Cochrane Collaboration (2011), but only one part of the 2x2s is in the rare event category (0.0%). Nonetheless, the OR metaanalysis result  9.37  is close to the OR range (5 to 8) estimates given above and is therefore reasonable.
The Shaffer et al. (1995) & Renaud et al. (2010) Studies: Peto Method Odds Ratios* & Related MetaAnalysis 

Categories 
Shaffer et al.  Renaud et al.  
Suicides 
Controls 
Suicides 
Control 

Homosexual: Yes 
3  0 
4  0 
Homosexual: No 
117 
147  51 
55 
Peto Odds Ratio (95% CI) 
9.41 (0.96  92.33) ^{1} 9.72 (0.99  95.62) ^{2} 
7.82 (1.07 57.06) ^{1} 7.82 (1.07  57.06) ^{2} 

Two Studies Combined Via MetaAnalysis 
8.47 (1.89  37.92) ^{1} OR: 8.59 (1.92  38.48)  Fixed effect. ^{2} OR: 8.59 (1.92  38.48)  Random effects. ^{2} 


Peto Odds Ratio Method: Bradburn et al. (2007) describe how the Odd Ratios are estimated using the Peto method: "The Peto onestep method [16: Yusuf et al, 1985] computes an approximation of the logodds from the ratio of the efficient score to the Fisher information, both evaluated under the null hypothesis. These quantities are estimated, respectively, by the sum of the differences between the observed and expected numbers of events in the treatment arm and by the sum of the conditional hypergeometric variances." (p. 55) The method works well with low incidences (less than 1%), including zero events (n = 0) that are present in both the Shaffer and Renaud studies. However, it produces increasingly biased results (OR underestimates) when incidences are greater that 1%. For the Shaffer study, the underestimate would apply given the 2.50% incidence (3 / 120), with the same applying for the Renaud study: 7.27% (4 / 55).
Abstract: One of the most frequently cited reasons for conducting a metaanalysis is the increase in statistical power that it affords a reviewer. This article demonstrates that fixedeffects metaanalysis increases statistical power by reducing the standard error of the weighted average effect size (T̄.) and, in so doing, shrinks the confidence interval around T̄.. Small confidence intervals make it more likely for reviewers to detect nonzero population effects, thereby increasing statistical power. Smaller confidence intervals also represent increased precision of the estimated population effect size. Computational examples are provided for 3 effectsize indices: d (standardized mean difference), Pearson's r , and odds ratios. Randomeffects metaanalyses also may show increased statistical power and a smaller standard error of the weighted average effect size. However, the authors demonstrate that increasing the number of studies in a randomeffects metaanalysis does not always increase statistical power.
"Although several articles have addressed issues related to metaanalysis and statistical power (e.g., Hedges & Pigott, 2001; Hunter & Schmidt, 1990; Strube, 1985; Strube & Miller, 1986), no article has explained how metaanalysis increases the statistical power of tests of overall treatment effects and relationships. This article addresses this gap in explanation." (p. 246)For metaanalysis criticisms and related evaluations, see Borenstein et al. (2009). Note that the criticisms of metaanalysis would not apply for the metaanalysis results reported on this webpage. The two studies being combined via metaanalysis used almost identical methodologies and the purpose of the metaanalyses is to report only the results for the two studies combined. The metaanalytic combinations of the two studies also always resulted in statistically significant differences.
From Section "16.9.5 Validity of methods of metaanalysis for rare events" (PDF Download):
"Bradburn et al. found that many of the most commonly used metaanalytical methods were biased when events were rare (Bradburn 2007). The bias was greatest in inverse variance and DerSimonian and Laird odds ratio and risk difference methods, and the MantelHaenszel odds ratio method using a 0.5 zerocell correction. As already noted, risk difference metaanalytical methods tended to show conservative confidence interval coverage and low statistical power when risks of events were low.
At event rates below 1% the Peto onestep odds ratio method was found to be the least biased and most powerful method, and provided the best confidence interval coverage, provided there was no substantial imbalance between treatment and control group sizes within studies, and treatment effects were not exceptionally large. This finding was consistently observed across three different metaanalytical scenarios, and was also observed by Sweeting et al. (Sweeting 2004)...
Methods that should be avoided with rare events are the inversevariance methods (including the DerSimonian and Laird randomeffects method). These directly incorporate the study’s variance in the estimation of its contribution to the metaanalysis, but these are usually based on a largesample variance approximation, which was not intended for use with rare events. The DerSimonian and Laird method is the only randomeffects method commonly available in metaanalytic software. We would suggest that incorporation of heterogeneity into an estimate of a treatment effect should be a secondary consideration when attempting to produce estimates of effects from sparse data – the primary concern is to discern whether there is any signal of an effect in the data."
From Section "9.4.4.2 Peto odds ratio method" (Full Text):
Peto’s method (Yusuf 1985) can only be used to pool odds ratios. It uses an inverse variance approach but utilizes an approximate method of estimating the log odds ratio, and uses different weights. An alternative way of viewing the Peto method is as a sum of ‘O – E’ statistics. Here, O is the observed number of events and E is an expected number of events in the experimental intervention group of each study. The approximation used in the computation of the log odds ratio works well when intervention effects are small (odds ratios are close to one), events are not particularly common and the studies have similar numbers in experimental and control groups. In other situations it has been shown to give biased answers. As these criteria are not always fulfilled, Peto’s method is not recommended as a default approach for metaanalysis. Corrections for zero cell counts are not necessary when using Peto’s method. Perhaps for this reason, this method performs well when events are very rare (Bradburn 2007) (see Chapter 16, Section 16.9).
Cochrane Collaboration’s Open Learning Material for Cochrane reviewers  From Section "Combining studies: Weighted Averages" (PDF Download):
The Peto method: The Peto method works for odds ratios only. Focus is placed on the observed number of events in the experimental intervention. We call this O for 'observed' number of events, and compare this with E, the 'expected' number of events. Hence an alternative name for this method is the 'O  E' method. The expected number is calculated using the overall event rate in both the experimental and control groups. Because of the way the Peto method calculates odds ratios, it is appropriate when trials have roughly equal number of participants in each group and treatment effects are small. Indeed, it was developed for use in megatrials in cancer and heart disease where small effects are likely, yet very important. The Peto method is better than the other approaches at estimating odds ratios when there are lots of trials with no events in one or both arms. It is the best method to use with rare outcomes of this type. The Peto method is generally less useful in Cochrane reviews, where trials are often small and some treatment effects may be large.
Excerpt: "The p value computes not the probability of the observed data under H0, but this plus the probability of more extreme data. This is a major weakness regarding the usefulness of p values. Because they are defined as a procedure for establishing the probability of an outcome, as well as more extreme ones, on a null hypothesis, significance tests are affected by how the probability distribution is spread over unobserved outcomes in the sample space. That is, the p value denotes not only the probability of what was observed, but also the probabilities of all the more extreme events that did not arise. How is it that these more extreme, unobserved, cases are involved in calculating the p value?" (p. 78)Moran JL (2006). Statistical Issues in the Analysis of Outcomes in Critical Care Medicine. Dissertation for the Degree of Doctor of Medicine, Department of Intensive Care Medicine, University of Adelaide, Australia. Abstract & Download Page. PDF Download.
Moran JL, Solomon PJ (2004). Point of View: A Farewell to Pvalues? Critical Care and Resuscitation, 6: 130137. PDF Download. Read online.
"Statisticians
are not to blame for the misconceptions in psychology about the use of
statistical methods. They have warned us about the use of the
hypothesistesting models and the related concepts. In particular they
have criticized the null hypothesis model and have recommended
alternative procedures similar to those recommended here (See Savage,
1957; Tukey, 1954; and Yates, 1951)." (Nunnally, 1960: 649) Nunnally, Jum (1960). The Place of Statistics in Psychology. Educational and Psychological Measurement, 20(4); 641650. Abstract. 50 Years Later:
"Recommendation 2.17: Researchers must continue to move beyond a sole reliance on statistical significance in interpreting quantitative research in suicidology to address issues of the clinical and practical usefulness of their results." (Rogers & Lester, 2010: Chapter 2: General Methodological Issues, p. 22) Rogers JR, Lester D (2010). Understanding Suicide: Why We Don't and How We Might. Cambridge, MA: Hogrefe Publishing. Hogrefe Publishing. Amazon. Book Review. 
"The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false (i.e. the probability of not committing a Type II error, or making a false negative decision [as it will be shown to apply in both the Shaffer et al. (1995) and Renaud et al. (2010) studies]). The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, the chances of a Type II error occurring decrease. The probability of a Type II error occurring is referred to as the false negative rate (β). Therefore power is equal to 1 − β, which is also known as the sensitivity.Given that power analyses were generally lacking in published medical research papers that required such analyses, Friedman et al. (2001) emphasized what was required by ending their paper with:
Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis."
"A proper design of the study and appropriate statistical analysis are essential to the validity of all quantitative clinical research. A typeI error is better known than a typeII error and reviewers and readers are more cognisant of p values when authors conclude that significant differences between groups are found. Equal scrutiny is required when authors decide that there is no statistically significant difference.Unfortunately Power Analyses have been generally ignored by researchers in spite of repeated recommendations in many fields of study (Sedlmaier & Gigerenzer, 1989; Friedman et al. (2001); Jennionsa & Møller, 2003; Cashen & Geiger, 2004; Balkin & Sheperis, 2011; Lau & Kuk, 2011), essentially meaning that "ignorance" of a basic concept in statistics reigns supreme in the research world for both researchers and numerous peer reviewers evaluating papers submitted for publication. This level of "high ignorance" especially applies with respect to Null Hypothesis Testing methods (statistical significance testing), and especially with low count samples and samples with rare events, including zero events in 2x2 cells. This issue, however, will be addressed later on this page.
In this age of limited resources and tight budgets, physicians may be forced to employ the cheapest methods, especially if the choices are thought to be similar. It is therefore important that investigators do not erroneously label two treatments as equivalent, when it has merely been shown that the differences were not statistically significant. All clinical studies should be based on appropriate calculations of sample size. The awareness of typeI error and the popularity of p values should be matched by equal cognisance of typeII error and β values. The practice of evidencebased medicine requires no less." (p. 401)
For researchers, knowing the optimal number of subjects to recruit to a study will provide more certainty in the conclusion without wasting unnecessary resources (such as time and money). For clinicians who review the research, knowing that the study was performed with an adequate number of subjects could instill additional confidence in the reported findings. Thus, it is important that both researchers and consumers of the research understand whether the study was conducted with an adequate sample of subjects. One way to choose an appropriate sample is to use power analysis." (p. 30)
"Power analysis can be done after the data has been collected in order to report the chance (P [= Power]) of detecting a true effect. This is known as a posthoc power analysis. The American Psychological Association recommends that researchers include a posthoc power analysis as a good practice when nonsignificant results are reported in the paper. This information can be used to improve the design of future independent studies that replicate the nonsignificant study. Unfortunately, it is uncommon to find posthoc power analyses in publications." (p. 30)
Aberson et al. (2002) summarized the Statistical Power misunderstandings and misconceptions in "An Interactive Tutorial for Teaching Statistical Power":"Another useful application of power analysis is to determine the minimal number of subjects in a research design prior to data collection. This is known as a priori power analysis. An increasing number of journal reviewers are asking for a priori power analyses to justify the sample size in a submitted paper, especially for those studies with a relatively small pool of subjects.
Posthoc power analysis to determine the impact of treatment: When applying power analysis, four essential, interrelated parameters are required: sample size (N), effect size (ES) [Greatly ignored, see: Cohen (1992) & Alhija (2009)], significance criterion (α), and power (P). The sample size N denotes the number of subjects who take part in the experiment. The effect size is also known as the sensitivity of the test. ES can be conventionally expressed as different indices (with different notations) according to the statistical tests being employed. For example, the effect size index is d for a ttest with two independent groups...
The calculation of pvalue can be referenced to any standard textbook on statistics. The significance criterion (α) in power analysis is the chance level for an error the researcher is willing to accept when the test shows a significant treatment effect [significant group difference], i.e., p < α. When α is set to 0.05, it means there is a 5% chance of making an error of believing that there is a true effect in the population when there is none.
In a posthoc power analysis, the parameters N, ES, and α are typically known, and the calculation of P is of the most interest to the researchers and its intended audience. If P is found to be greater than 0.8 [80%], it is assumed that the treatment effect is considered practically significant and has an impact in the real world. If P is less than 0.8 [80%], the value of ES could help in estimating how many subjects will be needed to yield a P greater than 0.8. The posthoc power analysis becomes an a priori power analysis." (pp. 3031)
"Statistical power considerations are important to adequate research design (Cohen 1988). Without sufficient statistical power, databased conclusions may be useless. Students and researchers often misunderstand factors relating to statistical power. One common misunderstanding is the relationship between Type II errors and Type I errors. Given a design with a 5% Type I error rate, students and researchers often predict the rate of Type II errors also to be 5% (Hunter 1997; Nickerson 2000). Of course, the probability of a Type II error is generally much greater than 5%, and in a given study, the probability of a Type II error is inversely related to the Type I error rate. Another misconception is the belief that failure to reject the null hypothesis is sufficient evidence that the null hypothesis is true (that is, failing to reject suggests that the null hypothesis is true; see Nickerson 2000 [this being "the misconception" dominating in the Shaffer and Renaud studies]). The prevalence of underpowered studies in many fields is striking evidence of a lack of comprehension of the relevance of statistical power to research design (for example, on average, the Type II error rate in psychology and education is estimated to be 50% or more; Sedlmeier and Gigerenzer 1989; Lipsey and Wilson 1993)."
Post Hoc Power Analysis (Fisher exact test)  Power = 0.35 (onesided or twosided)  Meaning high probability (0.65) for a Type II Error (as compared to the generally acceptable probability of 0.20). A Type II error happens when two groups are significantly different but that problems with the study do not make it possible to recognize this fact. [Calculator: Power (onesided or twosided Fisher exact test) = 0.03527. Same result with G*Power Program]
In their paper, the researchers could them have reported that the probability of a Type II error is high (0.65): that there is a significant difference between the groups but that the power to detect it was low (0.35). Furthermore, for future researchers seeking to replicate the study, they could have given the following sample size analysis based on their study results and the need for a power of 0.80.
 FisherTest, p1 = .025 [3 / 120], p2 = 0.000 [0 / 147], alpha = 0.05, power = 0.80 (Acceptable Level):
 Onesided Fisher Exact Test: n1 = n2 = 339 = Sample Size Required [G*Power: n1 = n2 = 339]
 Twosided Fisher Exact Test: n1 = n2 = 445 = Sample Size Required [G*Power: n1 = n2 = 445]
The above information would also have precluded the study authors from making their monumental interpretation error: that their statistically nonsignificant pvalue ( p = 0.90, onesided or twosided Fisher exact test), meaning that the Null Hypothesis is not rejected (a fact), interpreted as meaning that there is no difference between the two groups (a likely falsehood). Such a conclusion would only be plausible  but still not absolute  if the study had a power of 0.80.
Whereas Shaffer et al. could defend at this point that they could not know the effect size prior to carrying out the study and thus were not able to do an apriori power analysis, this does not apply to Renaud et al.. The a priori power analysis for the Renaud study would have been the post hoc power analysis, including the sample size recommendation (detailed above) that the Shaffer et al. researchers should have included in their paper. Given that the information was not made available, it should have been generated by the Renaud et al researchers and reported. It would then have been important to explain, given the facts of the case, why they chose to only have 55 individuals in both the suicide group and the control group. At this point they would have had to generate another analysis, as based on the Shaffer et al. results, that would have reported on the expected power of the study they were to carry out.
 Fisher Exact Test, p1 = .025, p2 = 0.000, alpha = 0.05, n 1 = n2 = 55  Power = 0.0025 (twosided), 0.0121 (onesided). Therefore, the proposed study only had a 1.2% chance of detecting that the two groups were maybe different by using the Fisher exact onesided test. [Calculator]
This information would have revealed, given that homosexually related questions to be used with informants were from the Fisher et al. study, that the results of the Shaffer et al. study were known, including the counts (3 / 120 vs 0 / 147) and the related statistical nonsignificance of the analysis (p = 0.90). This would have been embarrassing to acknowledge, as would have been these questions posed to the researchers:
 Given that Shaffer et al. only had 3 homosexually oriented individuals in a sample of 120 adolescent suicides, how many could you expect from your sample of 55 adolescent suicides? One? Maybe two at best?
 Given that Shaffer et al. had zero homosexually oriented individuals in a sample of 147 adolescents used for their control sample, was not the use of 55 adolescents in your control sample more like the assurance that there will be zero homosexually oriented adolescents in the control group?
 Nonetheless, the study was carried out with the likely knowledge that homosexually oriented adolescent counts would be so low that publishing results would be precluded. As could have been expected, there were zero homosexually oriented adolescents in the control group. However, likely as totally unexpected, there were 4 homosexually oriented individuals in the adolescent suicide group, or 7.27% (4 / 55), compared to 2.5% (3 / 120) in the Shaffer at al. study (a 3times increase). Plus, the statistical analysis produced Fisher statistical significance values of p = 0.118 (twosided) and p = 0.059 (onesided), thus indicating that the pvalue was now much closer to the value required to reject the Null Hypothesis: p < 0.05. Nonetheless, Renaud et al. replicated the Fisher at al. conclusion and, therefore, also made the same error. That is, the equated statistical nonsignificance  meaning not rejecting the Null Hypothesis  was also assumed to mean that there were no difference between the groups with respect to the presence of homosexually oriented adolescents. They also failed to give a post hoc power analysis at the end of their paper. The result of such a power analysis would have been: For a twosided Fisher exact test, their study only had a power of 0.20 for detecting a difference between the two groups, with the power being still very low  0.36  for a onesided Fisher exact test.
Post Hoc Power Analysis (Fisher exact test)  Power = 0.37 (onesided) and Power = 0.21 (twosided) [Same result with G*Power Program]  Meaning high probability (0.64) for a Type II Error (as compared to the generally acceptable probability of 0.20). [Calculator: Power, Fisher exact test, onesided (0.3708) or twosided (0.2085)]
For future researchers seeking to replicate the study, they could then have given the following sample size analysis based on their study result and the need for a power of 0.80.
 FisherTest, p1 = .0727 [4 / 55], p2 = 0.000 [0 / 55], alpha = 0.05, power = 0.80 (Acceptable Level):
 Onesided Fisher Exact Test: n1 = n2 = 111 = Sample Size Required [G*Power: n1 = n2 = 110]
 Twosided Fisher Exact Test: n1 = n2 = 147 = Sample Size Required [G*Power: n1 = n2 = 144]
The above would have also precluded the study authors from making their monumental interpretation error: that their statistically nonsignificant pvalue (p = 0.59, onesided Fisher exact test), meaning that the Null Hypothesis is not rejected (a fact), was interpreted as there is no difference between the two groups (a likely falsehood). Such a conclusion would only be possible  but still not absolute  if the study had a power of 0.80.
However, given the Shaffer et al. results, the Renaud et al. researchers could have recommended that future researchers use higher n's, to increase the probability that such a future study might have count indicating that the study power is close to 0.80. This numbers could have been about 200 for adolescent suicides and about 300 for the control sample.
The percentage for homosexually oriented adolescents in the two suicide group could be averaged ([2.5% + 7.3% = 9.8] / 2 = 4.9%), with the 5% proportion & 0% in control group then used for a sample size estimate using a power of 0.80.
FisherTest (OneSided), p1 = .05, p2 = 0.000, alpha = 0.05, power = 0.80 (Acceptable Level), n2 = twice the size of n1:
Sample Size Required: "Assuming outcome data will be analyzed prospectively by Fisher's exacttest or with a continuity corrected chisquared test and that all observations are independent": n1 = 128, n2 = 255 (Calculator). G*Power: n1 = 120, n2 = 240
Sample Size Required: With Continuity Correction: n1 = 127, n2 = 254 (Calculator)
Generated with "PS Power" Program: "We are planning a study of independent cases and controls with 2 control(s) per case. Prior data indicate that the [homosexuality] rate among controls is 0. If the true [homosexuality] rate for experimental subjects is 0.049, we will need to study 127 experimental subjects and 254 control subjects to be able to reject the null hypothesis that the [homosexuality] rates for experimental and control subjects are equal with probability (power) 0.8. The Type I error probability associated with this test of this null hypothesis is 0.05. We will use a continuitycorrected chisquared statistic or Fisher’s exact test to evaluate this null hypothesis." n1 = 127, n2 = 254 (Calculator Download)
Possible Outcomes, assuming a study with n1 = 128, n2 = 256:
 Possible Study Results: 5 / 128 (3.9%) vs 0 / 256 (0.0%)  Power = 0.88 [Power Calculator], Fisher One/TwoSided: 0.012 [Fisher Exact Tests] Peto OR Estimate: 20.7 (3.2  134.4) [Peto OR Estimate Calculations: DJR Hutchon.]
 Possible Study Results: 4 / 128 (3.1%) vs 0 / 256 (0.0%)  Power = 0.76, Fisher One/TwoSided: 0.012  Peto OR Estimate: 20.6 (2.5  165.8)
 Possible Study Results: 5 / 128 (3.9%) vs 1 / 256 (0.39%)  Power = 0.73, Fisher One/TwoSided: 0.017  OR: 10.3 (1.2, 237.1) (Calculator for ORs & Fisher Exact Test]
 Possible Study Results: 4 / 128 (3.1%) vs 1 / 256 (0.39%)  Power = 0.57, Fisher One/TwoSided: 0.044  OR: 8.2 (0.9  199.4)
Note: The doubling of the living control sample, compared to study sample (as opposed to proposing the use of samesize samples), would be recommended because at least 3 counts could be expected in the studied suicide group (3 out of 120, or 4 out of 55); therefore making possible maybe 4 or 5 homosexually oriented adolescents in a group of 128 adolescent suicides). The larger control sample (n = 256) might produce at least one homosexually oriented adolescent. If not, the "0" value would nonetheless still produce the conclusion that homosexually oriented adolescents are at a much greater risk for suicide (An Odds Ratio > 8), as evidenced from the Odds Ratios that would results with only one homosexually oriented adolescents in the control group and 4 or 5 homosexually oriented adolescent suicides. Increasing the living control rate would also be a cost effective way to get more precise estimations.
See above. It must be noted, however, that using the (conditional) Fisher exact test was most inappropriate for the Shaffer and Renaud studies, given the low event counts and especially given the presence of zero counts in 2x2 cells ("0" homosexually oriented adolescents in both control groups). More appropriate methods  generally using unconditional tests, but also using other tests  will be explored below, the conclusion being that the Null Hypothesis should have been rejected in both studies, and that homosexually oriented adolescents have been at greater risk for suicide compared to their heterosexual counterparts. This conclusion is even more supported by the results of the two studies combined via metaanalysis. What now remains to be determined is the magnitude of their greater risk for suicide expressed either as a Risk Ratio or an Odds Ratio. Therefore, having sufficient event counts, especially in the control group, is of great importance to make such calculations possible. At worse, given the above "results" possibilities, with "0" counts in the control group, we could nonetheless conclude that the ORs would be at least 8 to 10, with one problem: wide confidence intervals.
Abstract: This paper describes an interactive Webbased tutorial that supplements instruction on statistical power. This freely available tutorial provides several interactive exercises that guide students as they draw multiple samples from various populations and compare results for populations with differing parameters (for example, small standard deviation versus large standard deviation). The tutorial assignment includes diagnostic multiplechoice questions with feedback addressing misconceptions, and followup questions suitable for grading. The sampling exercises utilize an interactive Java applet that graphically demonstrates relationships between statistical power and effect size, null and alternative populations and sampling distributions, and Type I and II error rates. The applet allows students to manipulate the mean and standard deviation of populations, sample sizes, and Type I error rate.Alhija FNA (2009). Effect Size Reporting Practices in Published Articles. Educational and Psychological Measurement, 69(2): 245265. Abstract.
Abstract: Effect size (ES) reporting practices in a sample of 10 educational research journals are examined in this study. Five of these journals explicitly require reporting ES and the other 5 have no such policy. Data were obtained from 99 articles published in the years 2003 and 2004, in which 183 statistical analyses were conducted. Findings indicate no major differences between the two types of journals in terms of ES reporting practices. Different conclusions could be reached based on interpreting ES versus p values. The discrepancy between conclusions based on statistical versus practical significance is frequently not reported, not interpreted, and mostly not discussed or resolved.Balkin RS, Sheperis CJ (2011). Evaluating and reporting statistical power in counseling research. Journal of Counseling & Development, 9: 268272. Abstract.
Abstract: Despite recommendations from the Publication Manual of the American Psychological Association (6th ed.) to include information on statistical power when publishing quantitative results, authors seldom include analysis or discussion of statistical power. The rationale for discussing statistical power is addressed, approaches to using G*Power to report statistical power are presented, and examples for reporting statistical power are provided. [The G*Power Program: Download Page. An easytouse javabased program for a post hoc analysis of power and using the Fisher exact onesided or twosided test: Download Page. Also available: The "PS Power" program: Download Page.]Cashen LH, Geiger SW (2004). Statistical Power and the Testing of Null Hypotheses: A Review of Contemporary Management Research and Recommendations for Future Studies. Organuizational Research Methods, 7(2): 151167. Abstract. Full Text.
Abstract: The purpose of this study is to determine how well contemporary management research fares on the issue of statistical power with regard to studies specifically predicting null relationships between phenomena of interest. This power assessment differs from traditional power studies because it focuses solely on studies that offered and tested null hypotheses. A sample of studies containing hypothesized null relationships was taken from five mainstream management journals over the 1990 to 1999 time period. Results of the power assessment suggest that management researchers’ abilities to affirm null hypotheses are low. On average, the power assessment revealed that for those studies that found nonsignificance of results and consequently affirmed their null hypotheses, the actual Type II error rate was nearly 15 times greater than what is advocated in the literature when failing to reject a false null hypothesis [emphasis mine]. Recommendations for researchers proposing and testing formal null hypotheses are also discussed.Cohen, Jacob (1992). A Power Primer. Psychological Bulletin, 112(1): 155159. Abstract. Full Text.
Abstract: One possible reason for the continued neglect of statistical power analysis in research in the behavioral sciences is the inaccessibility of or difficulty with the standard material. A convenient, although not comprehensive, presentation of required sample sizes is provided. Effectsize indexes and conventional values for these are given for operationally defined small, medium, and large effects. The sample sizes necessary for .80 power to detect effects at these levels are tabled for 8 standard statistical tests: (1) the difference between independent means, (2) the significance of a product–moment correlation, (3) the difference between independent r s, (4) the sign test, (5) the difference between independent proportions, (6) chisquare tests for goodness of fit and contingency tables, (7) 1way analysis of variance (ANOVA), and (8) the significance of a multiple or multiple partial correlation.Freedman KB, Back S, Bernstein J (2001). Sample size and statistical power of randomised, controlled trials in orthopaedics. The Journal of Bone and Joint Surgery. British Volume, 83(3): 397402. Abstract. Full Text.
Lau CC, Kuk F (2011). Enough is enough: A primer on power analysis in study designs. The Hearing Journal, 64(4): 3039. Abstract / Download Page.
Abstract: We estimated the statistical power of the first and last statistical test presented in 697 papers from 10 behavioral journals. First tests had significantly greater statistical power and reported more significant results (smaller p values) than did last tests. This trend was consistent across journals, taxa, and the type of statistical test used. On average, statistical power was 13–16% to detect a small effect and 40–47% to detect a medium effect. This is far lower than the general recommendation of a power of 80%. By this criterion, only 2–3%, 13–21%, and 37–50% of the tests examined had the requisite power to detect a small, medium, or large effect, respectively [emphasis mine]...Nickerson RS (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods, 5(2):241301. Abstract.
Abstract: Null hypothesis significance testing (NHST) is arguably the most widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.Sedlmeier P, Gigerenzer G (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105: 309316. Abstract. Full Text.
Abstract: The longterm impact of studies of statistical power is investigated using J. Cohen's (1962) pioneering work as an example. We argue that the impact is nil; the power of studies in the same journal that Cohen reviewed (now the Journal of Abnormal Psychology) has not increased over the past 24 years. In 1960 the median power (i.e., the probability that a significant result will be obtained if there is a true effect) was .46 for a medium size effect, whereas in 1984 it was only .37. The decline of power is a result of alphaadjusted procedures. Low power seems to go unnoticed: only 2 out of 64 experiments mentioned power, and it was never estimated. Nonsignificance was generally interpreted as confirmation of the null hypothesis (if this was the research hypothesis), although the median power was as low as .25 in these cases. We discuss reasons for the ongoing neglect of power.
"If
the author does not obtain significant results in his/her study, the
likelihood of being published is severely diminished due to the
publication bias that exists for statistically significant results
(Begg, 1994). As a result there may be literally thousands of studies
with meaningful effect sizes that have been rejected for publication or
never submitted for publication. These studies are lost because they do
not pass muster with NHST." (Nix & Barnette, 1998: 56) Nix TW, Barnette JJ (1998). The Data Analysis Dilemma: Ban or Abandon. A Review of Null Hypothesis Significance Testing. Research in the Schools, 5(2): 314. Full Text. Comment
Yet, both the Shaffer et al. (1995) & Renaud et al. (2010) were published even if the results were not statistically significant. Was there maybe another form of "bias" that made these studies so easy to be publish? Could it be that they were published because they both blatantly accepted the null hypothesis, even if doing that violated protocol? Why was this not detected by peer reviewers or the journal editors? 
The Shaffer et al. (1995) & Renaud et al. (2010) Studies: Arcsine Difference MetaAnalysis * 






Our calculations indicate that OR = 1.68, 3.47, and 6.71 are equivalent to Cohen’s d = 0.2 (small), 0.5 (medium), and 0.8 (large), respectively, when disease rate is 1% in the nonexposed group; Cohen’s d < 0.2 when OR < 1.5, and Cohen’s d > 0.8 when OR > 5. It would be useful to values with corresponding qualitative descriptors that estimate the strength of such associations; however, to date there is no consensus as to what those values of OR may be. Cohen (1988) suggested that d = 0.2, 0.5, and 0.8 are small, medium, and large on the basis of his experience as a statistician, but he also warned that these were only “rules of thumb.” Better guidelines are needed to draw conclusions about strength of associations in studies of risks for disease when we use OR as the index of effect size in epidemiological studies. (p. 864)Coe, Robert (2002). It's the Effect Size, Stupid. What effect size is and why it is important. Paper presented at the Annual Conference of the British Educational Research Association, University of Exeter, England, 1214 September 2002. Full Text.
Cohen, Jacob (1988). Statistical power analysis for the behavioral sciences. Second Edition. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Inc.. Google Books. Amazon. Excerpts on a Related Webpage.
"Suicidologists have had a
great difficulty in identifying meaningful correlates and predictors of
suicidal behavior. Because of this Neuringer and Kolstoe (1966)
suggested adopting less stringent criteria for statistical significance
in suicide research, perhaps allowing rejection of the null hypothesis
at the 10% level instead of the 5% level. This is an intriguing idea
which has never been followed up, but it would result in the appearance
of a larger proportion of "significant" results that were never
replicated." (Rogers & Lester, 2010: Chapter 2: General Methodological Issues, p. 21) Neuringer C, Kolstoe RH (1966). Suicide research and the nonrejection of the null hypothesis. Perceptual & Motor Skills, 22: 115118. Summary & First Page Excerpt. Rogers JR, Lester D (2010). Understanding Suicide: Why We Don't and How We Might. Cambridge, MA: Hogrefe Publishing. Hogrefe Publishing. Amazon. Book Review. Comment
A Not Mentioned Possibility: Maybe the most commonly used statistical test was flawed in some ways. Maybe more powerful tests were available to more accurately determine whether or not the null hypothesis should be rejected? 
By 1990, an important fact related to the conditional Fisher Exact Test was highlighted in the Hirji et al. (1991) abstract: "The use of the Fisher exact test for comparing two independent binomial proportions has spawned an extensive controversy in the statistical literature. Many critics have faulted this test for being highly conservative."In the Shaffer et al. and Renaud et al. studies, the widespread Fisher Exact Test was used. In this section, we demonstrate that this test is unnecessarily conservative and that more appropriate tests are available, which would lead to a reversal of the conclusions made by the authors.
By being "highly conservative," in a situation where the "conditional" Fisher Exact Test has produced a p = 0.070 statistical significance, for example, this would mean that the Null Hypothesis is not rejected, also meaning that the difference between the 2 compared groups is not statistically significant, but these conclusions might be in error. That is, if more appropriate statistical tests had been carried out  such as "unconditional exact tests"  or other tests for statistical significance that approximate unconditional test results such as Midp  most "p" results might have been in the "statistically significant" category. That is, the conservatism of the Fisher Exact test would have been removed as noted by Agresti & Gottard (2007). With respect to the Shaffer (1995) and Renaud (2010) studies, the onesided Fisher Exact Test was used, but Lyderson et al. (2009) emphasized the following in both the abstract and conclusion of their paper:
"The traditional Fisher's exact test should practically never be used." The reason for this is given: "Unconditional tests preserve the significance level and generally are more powerful than [the conditional] Fisher's exact test for moderate to small samples, but previously were disadvantaged by being computationally demanding. This disadvantage is now moot, as software to facilitate unconditional tests has been available for years. Moreover, Fisher's exact test with midp adjustment gives about the same results as an unconditional test. Consequently, several better tests are available, and the choice of a test should depend only on its merits for the application involved. Unconditional tests and the midp approach ought to be used more than they now are." ... Elsewhere in the paper: "We consider exact unconditional tests to be the gold standard for testing association in 2×2 tables."In other words, in many cases, and especially in studies with two small independent binomial samples, the conditional Fisher Exact Test is less powerful and less accurate or less exact for the determination of statistical significance when comparing the two samples. One concern with using the Midp value, however, has been that Midp statistically significant outcomes might produce more Type I errors compared to those occurring with the Fisher Exact Test (determinations of statistical significance when nonsignificance exists). Concerning this issue, researchers have been reporting that the problem is either minimal or that it does not exist (e.g. Hwang & Yang, 2001; Crans & Shuster, 2008; Parzen, 2009; Biddle & Morris, 2011). It is also noted that using Midp values reduce high Type II error rates produced by the Fisher Exact Test, which seems to have been the problem when the Fisher Exact Test was used in the Shaffer and Renaud study data. Concerning this issue, Biddle and Morris (2011) state:
"In recent years, there has been growing support in the statistical literature regarding an adjustment to the FET [Fisher exact test], namely Lancaster’s midP (LMP) test (Lancaster, 1961), which yields a better balance of Type I and Type II errors. The LMP test produces Type I error rates that are generally closer to the nominal significance level (typically set at .05) than the FET, which often shows Type I error rates substantially below the nominal level (Crans & Shuster, 2008). Consequently, the LMP test often has greater statistical power (lower Type II error rates) than the FET. The LMP test has been widely accepted in the medical and statistics fields..." (p. 956)Fellows (2010) presents the best support to date for using midp values:
"Although the midp was developed over 50 years ago, practitioners have yet to adopt it into everyday practice. Because the midp can have Type I error greater than its nominal level, it could possibly be viewed as deceptive (Routledge, 1994). This view is understandable considering the lack of solid theoretical grounding. The midp has, with some exceptions (Lancaster, 1961; Hwang and Yang, 2001), been justified using heuristic devices (Barnard, 1989; Berry and Armitage, 1995) and numerical simulation (Hirji, 1991). While these justifications do provide some comfort that the midp is safe, one would be correct to be somewhat uneasy with its wholesale application to all problems. Under estimated truth framework, the midp has a solid claim to primacy. The minimax theorems developed in this article show that in the worstcase scenarios, the midp is the least risky pvalue. These results apply to a surprisingly wide range of situations. In the case of a simple hypothesis test, the midp and the likelihood ratio ordering were found to be mutually supportive of one another. In the case of onesided tests, the mid pvalue was shown to dominate all other members of its family. The mid pvalue has smaller maximum risk than any of the more conservative members of its family in the case of two sided tests, and dominates all of its family in some important special cases that pervade statistical practice." (p. 251)
In the "Statistical Significance Tests" table, results from varied unconditional test, Midp analyses and other tests (e.g., Arcsine Difference) are given for both the Shaffer and Renaud studies. In the great majority of cases (and more so in onesided tests), results generated from unconditional tests, for Midp values analyses and other tests indicate statistical significance for the difference between the two independent binomial samples in both studies (meaning that the Null Hypothesis is to be rejected). On this basis, it is therefore concluded that homosexually oriented adolescents are likely at greater risk for suicide compared to their heterosexual counterparts. Parzen (2009) reports that Midp values also closely approximate what would be the result of a Bayesian analysis, this being what was observed given the Bayesian analysis and metaanalysis of the two studies, as summarized below.
The Shaffer et al. (1995) & Renaud et al. (2010) Studies: Suicide Risk & Adolescents Homosexuality 

Study 
Homosexuality As Suicide Risk Factor 
Study Information 
Shaffer et al. (1995) 
3 Suicides, All males = 3.2% of Males, =2.5% of Sample. Result Reported to be Not Significant: Fisher Exact Test, OneSided: p = 0.088 
Psychological Autopsy: 170
New
York City consecutive suicide victims  19841986  under the age
of 20, 120 available for study, 95 males and 25 females. Not one female
was
deemed to be gay/bisexual, or having had sex with a
female.  3 males deemed to be homosexual; one having acknowledged his
gay orientation,
and 2 having been homosexually active. 1 male, not classified to be
homosexual, committed suicide with one of
the homosexual male victims. Both were found dead holding hands.
Classified
only as a friend of gay teenagers.  5 other males "known to be close
friends of other gay teenagers."  3 males "reported to have been
effeminate in their behavior," were not classified
to be homosexual, but the only gay suicide victim was effeminate. There
were 83 males without information about there sexual orientation,
and they were assumed to be heterosexual. There may have been disclosed
homosexual individuals among the 83 males, thus biasing the actual
group difference downwards. .  83 males with no information which
could
lead researchers to suspect they
were gay/bisexual, or had homosexual experiences. Therefore considered
heterosexual unless proven otherwise.  147 male and female controls
studied, obtained from random sampling of
196, with 49 having refused to participate in the study. There were 116
males
in the control group.  The method used to discover if a suicide
victims were gay or bisexual and/or
if they ever had samegender sex consisted of asking: 1. "a parent or other adult member of the household in which the
victim was living at the time of death;"  2. "either a sibling or a friend from the victim's peer group nominated
by the parent or caretaker;"  3. "at least one school teacher (and, more usually three) nominated
by the school principal as being well informed about the subject's classroom
behavior". Caveat:
Low Counts. Plus: With "0" individuals classified as 'homosexual' in
control group, it is impossible to calculate and Odds Ratio. Therefore, control group should have been larger to make this possible. 
Renaud et al. (2010) 
4 Homosexual Suicides out of 55 Suicide Victims. Result Reported to be Not Significant (as above): Fisher Exact Test, OneSided: p = 0.061 
Canadian
(Quebec) youth suicide victims (n = 55, 2000 to 2003, 11 to 18 yearsold, 43 males,
12 females), 4 (3 males, 1 female) deemed to be homosexual. Control
subjects (n = 55), none deemed to be homosexual. "Within the suicide
victims, 4 people were found to have had a samesex sexual experience,
described themselves as having samesex sexual orientation, or
expressed concern regarding their sexual orientation (3 males and 1
female)." Caveat: Low Counts. Plus:
With "0" individuals classified as 'homosexual' in control group, it is
impossible to calculate and Odds Ratio. Therefore, control group should
have been larger to make this possible. 
Combining the Shaffer et al. (1995) & Renaud et al. (2010) Studies 


Statistical Calculations for Fisher Exact Test, One Sided, Carried Out At: http://statpages.org/ctab2x2.html . For other Statistical Tests of Significance for both studies, See Table Below. 
From: Plöderl et al. (2013) Note:
The combined Shaffer & Renaud study results are used in a book
chapter by Wetzels et al. (in press) to illustrate how a Bayesian
analysis proceeds.
Wetzels, R., van Ravenzwaaij, D., & Wagenmakers, E.J. (in press). Bayesian analysis. In R. Cautin, & S. Lilienfeld (Eds.), The Encyclopedia of Clinical Psychology. WileyBlackwell. Reference. PDF Download. 
Statistical Significance / pValue Tests Shaffer et al. (1995) & Renaud et al. (2010) Studies 

Significance Tests Calculated by Plõderl & Tremblay 
Abstract: The asymptotic Pearson's chisquared test and Fisher's exact test have long been the most used for testing association in 2X2 tables. Unconditional tests preserve the significance level and generally are more powerful than Fisher's exact test for moderate to small samples, but previously were disadvantaged by being computationally demanding. This disadvantage is now moot, as software to facilitate unconditional tests has been available for years. Moreover, Fisher's exact test with midp adjustment gives about the same results as an unconditional test. Consequently, several better tests are available, and the choice of a test should depend only on its merits for the application involved. Unconditional tests and the midp approach ought to be used more than they now are. The traditional Fisher's exact test should practically never be used (Emphasis added).
Recommendations: Exact tests have the important property of always preserving test size. Our general recommendation is not to condition on any marginals not fixed by design. In practice, this means that an exact unconditional test is ideal. Pearson’s chisquared (zpooled) statistic or Fisher–Boschloo’s statistic works well with an exact unconditional test. Further, such a test can be approximated by an exact conditional midp test or, in large samples, see, for example, the traditional asymptotic Pearson’s chisquared test. However, when an exact test is chosen, an unconditional test is clearly recommended. The traditional Fisher’s exact test should practically never be used (p. 1174, Emphasis added).
Abstract: Fisher's exact test for comparing response proportions in a randomized experiment can be overly conservative [Many Type II Errors] when the group sizes are small or when the response proportions are close to zero or one. This is primarily because the null distribution of the test statistic becomes too discrete, a partial consequence of the inference being conditional on the total number of responders. Accordingly, exact unconditional procedures have gained in popularity, on the premise that power will increase because the null distribution of the test statistic will presumably be less discrete. However, we caution researchers that a poor choice of test statistic for exact unconditional inference can actually result in a substantially less powerful analysis than Fisher's conditional test. To illustrate, we study a real example and provide exact test size and power results for several competing tests, for both balanced and unbalanced designs. Our results reveal that Fisher's test generally outperforms exact unconditional tests based on using as the test statistic either the observed difference in proportions, or the observed difference divided by its estimated standard error under the alternative hypothesis, the latter for unbalanced designs only. On the other hand, the exact unconditional test based on the observed difference divided by its estimated standard error under the null hypothesis (score statistic) outperforms Fisher's test, and is recommended. Boschloo's test, in which the pvalue from Fisher's test is used as the test statistic in an exact unconditional test, is uniformly more powerful than Fisher's test, and is also recommended.
"To circumvent the conservatism of exact conditional inference based on Fisher's test, Barnard (1945, 1947) proposed the use of exact unconditional inference, based on elimination of the nuisance parameter by maximization. However, invoking Fisher's principle of ancillarity (see Basu, 1977; Little, 1989), Barnard (1949) subsequently renounced his unconditional test. Despite his renouncement and other publications in support of Fisher's test (Yates, 1984; Barnard, 1989, Upton, 1992), exact unconditional inference has gained in popularity over the last few decades (Berkson, 1978; Kempthorne, 1979; Santner and Snell, 1980; Upton, 1982; Suissa and Shuster, 1985; Haber, 1986; D'Agostino, Chase, and Belanger, 1988; Rice, 1988; Haviland, 1990; Storer and Kim, 1990; Andres and Mato, 1994)." (p. 441)
Basu D (1977). On the elimination of nuisance parameters. Journal of the American Statistical Association, 72: 355366. Summary & Page 355.
Berkson J (1978). In dispraise of the exact test: Do the marginal totals of the 2X2 table contain relevant information respecting the table proportions? Journal of Statistical Planning and Inference, 2(1): 2742. Abstract.
D'Agostino RB, Chase W, Belanger A (1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations. American Statistician, 42: 198202. Summary & Page 198.
Haviland MG (1990). Yate's correction for continuity and the analysis of 2 x 2 contingency tables (with comments). Statistics in Medicine, 9: 363383. Abstract.
Kempthorne O (1979). In dispraise of the exact test: reactions. Journal of Statistical Planning and Inference, 3: 199213.
Little RJA (1989). Testing the equality of two independent binomial proportions. American Statistician, 43: 283288. Summary & Page 283.
Rice WR (1988). A new probability model for determining exact pvalues for contingency tables when comparing binomial proportions. Biometrics, 44: 122. Abstract. PDF Download. Comment & Author Reply. Comment & Author Reply.
Santner TJ, Snell MK (1980). SmallSample Confidence Intervals for p1  p2 and p1/p2 in 2 × 2 Contingency Tables. Journal of the American Statistical Association, 75, No. 370: 386394. Summary & Page 386.
Storer BE, Kim C (1990). Exact properties of some exact test statistics for comparing two binomial proportions. Journal of the American Statistical Association, 85: 146155. Summary & Page 146.
Upton G (1982). A Comparison of Alternative Tests for the 2 x 2 Comparative Trial. Journal of the Royal Statistical Society, Series A, 145: 86105. Summary & Page 86.
Yates F (1984). Tests of Significance for 2 × 2 Contingency Tables. Journal of the Royal Statistical Society. Series A, 147(3): 426463. Summary & Page 426.
The authors highlight, also with the use of graphics, that the Fisher test is "conditional," meaning that "the sample space for Fisher’s exact test much more discrete than [less than] it is for [the unconditional] Barnard’s exact test. Consequently, the number of distinct pvalues that one could obtain with Fisher’s exact test is less than the corresponding number of distinct pvalues that one could obtain with Barnard’s exact test [shown graphically]. This in turn implies that if we want to restrict the type1 error to some upper limit, say 5%, Fisher’s procedure will usually be more conservative than Barnard’s, resulting in a loss of power [For the example given, onetailed Fisher exact pvalue = 0.0641, Bernard's test pvalue = 0.0341]. The power loss diminishes as the sample sizes get larger since the discreteness of the Fisher statistic is not as pronounced."
The example given: 15 subjects receive a vaccine, 7 become infected, 8 remain not infected... versus ... 15 subjects receive placebo 'vaccine', 12 become infected, 3 remain not infected.
Related Calculations:
At SISA: the onesided Fisher exact test: p = 0.064068 = 0.0641  Midp for onesided Fisher exact test: p = 0.0372689 = 0.0373.
At "Exact Unconditional Homogeneity/Independence Tests for 2X2 Tables": Onesided Fisher's exact conditional pvalue = 0.0641.  Onesided FisherBoschloo, Confidence Interval Method (p = 0.0351), No Confidence interval Method (p = 0.0341). Calculations for other pvalue estimates possible.
Using Downloaded SMP Program, Version 2.1 http://www.ugr.es/~bioest/software.htm: Midp for onesided Fisher exact test: p = 0.03411. Unconditional Arc Sine Statistic: p = 0.03411.   Barnard`s Test, real pValue = 0.03669.   Barnard`s Test, estimated pValue = 0.03669
Martín Andrés A (1991). A review of classic nonasymptotic methods for comparing two proportions by means of independent sampLES. Communications in Statistics  Simulation and Computation, 20(2/3): 551583. Abstract.
Martín Andrés A (1997). Entry 'Fisher's exact and Barnard's tests'. Encyclopedia of Statistical Sciences. Update Volume 2, 2508. Ed.: Kotz, Johnson and Read. WileyInterscience.
Martín Andrés A, Herranz Tejedor I (1995). Is Fisher's exact test very conservative. Computational Statistics and Data Analysis, 19, 579591. Abstract. PDF Download.
Martín Andrés A, Silva Mato A (1994). Choosing the optimal unconditioned test for comparing two independent proportions. Computational Statistics and Data Analysis, 17(5): 555574. Abstract.
Martín Andrés A, Sánchez Quevedo MJ, Silva Mato A (1998). Fisher's midpvalue arrangement in 2x2 comparative trials. Computational Statistics and Data Analysis, 29(1), 107115. Abstract.
Silva Mato A, Martín Andrés A (1995). Optimal unconditional tables for comparing two independent proportions. Biometrical Journal, 37(7), 821836. Abstract.Silva Mato A, Martín Andrés A (1997). Simplifying the calculation of the Pvalue for Barnard's test and its derivatives. Statistics and Computing, 7(2): 137143. Abstract.
Silva Mato A, Martín Andrés A (1997). SMP.EXE in http://www.jiscmail.ac.uk/files/EXACTSTATS.
'This file, as the Fisher's exact test, performs the exact probability test for a table of frequency data crossclassified according to two categorical variables, each of which has two levels or subcategories (2x2). It is a nonparametric statistical test used to determine if there are nonrandom associations between the two categorical variables. Barnard's exact test is used to calculate an exact Pvalue with small number of expected frequencies, for which the Chisquare test is not appropriate (in case the total number of observations is less than 20 or the number of frequency cells are less than 5). The test was proposed by G. A. Barnard in two papers (1945 and 1947). While Barnard's test seems like a natural test to consider, it's not at all commonly used. This probably due that it is a little unknown. Perhaps due to its computational difficulty it is not widely used until recently, where the computers make it feasible. It is considering that the Barnard's exact test is more powerful than the Fisher's one..." [Note: This is likely not the true Barnard Test. Even StatXact had claimed to offer the Bernard Test that is described as follows by Lyderson et al. (2009): "Barnard’s unconditional test [20] uses a more computationally intensive algorithm for building a rejection region, and is, to our knowledge not included in any available software. StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in StatXact)." (p. 1166). Cytel (2008: http://www.cytel.com/software/StatXact.aspx) reports that "StatXact®9 [is] The Most Popular Exact Statistics Analysis Software... Only StatXact® has: ... exact power and sample size for comparing two binomials by Barnard's unconditional exact test (more powerful than Fisher's test)." See related Information by Martín Andrés (2012) at Silva Mato & Martín Andrés (2000).]
"Andres et al. [34] compared 15 test statistics, including the original test by Barnard. Barnard’s test and a simplified version of Barnard’s test have highest power, but are considered too computer intensive for practical use. Among the others, Pearson’s chisquared and Fisher–Boschloo have power nearly as high as the optimal Barnard’s test [34]."
"There are two fundamentally different exact tests for comparing the equality of two binomial probabilities – Fisher’s exact test (Fisher, 1925), and Barnard’s exact test (Barnard, 1945). Fisher’s exact test (Fisher, 1925) is the more popular of the two. In fact, Fisher was bitterly critical of Barnard’s proposal for esoteric reasons that we will not go into here. For 2 × 2 tables, Barnard’s test is more powerful than Fisher’s, as Barnard noted in his 1945 paper, much to Fisher’s chagrin..." [Note: This is likely not the true Barnard Test. Even StatXact had claimed to offer the Bernard Test that is described as follows by Lyderson et al. (2009): "Barnard’s unconditional test [20] uses a more computationally intensive algorithm for building a rejection region, and is, to our knowledge not included in any available software. StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in StatXact)." (p. 1166). Cytel (2008: http://www.cytel.com/software/StatXact.aspx) reports that "StatXact®9 [is] The Most Popular Exact Statistics Analysis Software... Only StatXact® has: ... exact power and sample size for comparing two binomials by Barnard's unconditional exact test (more powerful than Fisher's test)." See related Information by Martín Andrés (2012) at Silva Mato & Martín Andrés (2000).]
Abstract: Five standard tests are compared: chisquared, Fisher's exact, Yates’ correction, Fisher’s exact midp, and Barnard’s. Yates’ is always inferior to Fisher’s exact. Fisher’s exact is so conservative that one should look for alternatives. For certain sample sizes, Fisher’s midp or Barnard’s test maintain the nominal alpha and have superior power. [Note: To generate the Barnard Test result, it is likely that the referenced Cytel's StatXact 7 (2005) program was used. From Lyderson et al. (2009): "Barnard’s unconditional test [20] uses a more computationally intensive algorithm for building a rejection region, and is, to our knowledge not included in any available software. StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in StatXact)." (p. 1166).]
Abstract: The unconditional Barnard’s test for the comparison of two independent proportions is difficult to apply even with moderately large samples. The alternative is to use a χ^{2} type, arc sine or midp asymptotic test. In the paper, the authors evaluate some 60 of these tests, some new and others that are already familiar. For the ordinary significances, the optimal tests are the arc sine methods (with the improvement proposed by Anscombe), the χ^{2} ones given by Pearson (with a correction for continuity of 2 or of 1 depending on whether the sample sizes are equal or different) and the midpvalue ones given by Fisher (using the criterion proposed by Armitage, when applied as a twotailed test). For one(two) tailed tests, the first method generally produces reliable results E > 10.5 (E > 9 and unbalanced samples), the second method does so for E > 9 (E > 6) and the third does so for all cases, although for E <= 6 (E <= 10.5) it usually gives too many conservative results. E refers to the minimum expected quantity.
"The use of Fisher’s midpvalue as an approximation to the Barnard test is quite surprising and should be justified. Given that the Fisher exact test is very conservative (compared to Barnard’s test), Plackett, in his discussion of Yates (1984), proposed Fisher’s midpvalue as a means of reducing its conservatism. The idea was favourably received by Barnard (1989), Routledge (1992), Upton (1992) and Agresti (2001) because it was a way of terminating the conditional vs. unconditional argument (Haber, 1992). Haber (1986) was the first to propose midp as an approximation to the unconditional test, one that was described by Hirji et al. (1991) as a quasiexact test. Both the authors and Davis (1993) agree that midp is generally conservative, but quite less so than Fisher’s exact test, and behaves in a very similar fashion to the χ^{2} test without c.c. Note that although one needs to use a computer to apply the midp, actually obtaining it presents no problem (no matter what the value of n_{i} may be)." (p. 341)
Abstract: For two independent binomial proportions Barnard (1947) has introduced a method to construct a nonasymptotic unconditional test by maximisation of the probabilities over the ‘classical’ null hypothesis H_{0}= {(θ_{1}, θ_{2}) ∈ [0, 1]^{2}: θ_{1} = θ_{2}}. It is shown that this method is also useful when studying test problems for different null hypotheses such as, for example, shifted null hypotheses of the form H_{0} = {(θ_{1}, θ_{2}) ∈ [0, 1]^{2}: θ_{2} ≤ θ_{1} ± Δ } for noninferiority and 1sided superiority problems (including the classical null hypothesis with a 1sided alternative hypothesis). We will derive some results for the more general ‘shifted’ null hypotheses of the form H_{0} = {(θ_{1}, θ_{2}) ∈ [0, 1]^{2}: θ_{2} ≤ g(θ_{1} )} where g is a non decreasing curvilinear function of θ_{1}. Two examples for such null hypotheses in the regulatory setting are given. It is shown that the usual asymptotic approximations by the normal distribution may be quite unreliable. Nonasymptotic unconditional tests (and the corresponding pvalues) may, therefore, be an alternative, particularly because the effort to compute nonasymptotic unconditional pvalues for such more complex situations does not increase as compared to the classical situation. For ‘classical’ null hypotheses it is known that the number of possible pvalues derived by the unconditional method is very large, albeit finite, and the same is true for the null hypotheses studied in this paper. In most of the situations investigated it becomes obvious that Barnard's CSM test (1947) when adapted to the respective null space is again a very powerful test. A theorem is provided which in addition to allowing fast algorithms to compute unconditional nonasymptotical pvalues fills a methodological gap in the calculation of exact unconditional pvalues as it is implemented, for example, in StatXact 3 for Windows (1995).
Abstract: Unconditional nonasymptotic methods for comparing two independent binomial proportions have the drawback that they take a rather long time to compute. This problem is especially acute in the most powerful version of the method (Barnard, 1947). Thus, despite being the version which originated the method, it has hardly ever been used. This paper presents various properties which allow the computation time to be drastically reduced, thus enabling one to use not only the more traditional and simple versions given by McDonald et al. (1977) and Garside and Mack (1967), but also the more complex original version of Barnard (1947).
"The objective is to test H_{0} : p_{1} = p_{2} (= p), for which there are two competing methodologies: the conditional one (Fisher, 1935) and the unconditional one (Barnard, 1947). The argument about which methodology is more appropriate is nearly as old as modern statistics, and we will not deal with it here; the interested reader is referred to the reviews by Martín Andrés (1991), Richardson (1994) and Sahai and Khurshid (1995). For our purposes, it suffices to point out that the unconditional method is usually defended (see, for example, McDonald et al., 1977; Liddell, 1976; Haber, 1987) on the grounds that it is more powerful than the conditional method. However, it is computationally much more complex to implement." (p. 137)
Fisher RA (1935). The logic of inductive inference. Journal of the Royal Statistical Society A, 98: 3954. Abstract. PDF Download. Letters between RA Fisher & GA Barnard (19451962): PDF Download.
Fisher R (1955). Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society. Series B, 17(1): 6978. Abstract. PDF Download.
Garside GR, Mack C (1967). Correct confidence limits for the 2 x 2 homogeneity contingency table with small frequencies. New Journal of Statistics and Operations Research, 3(2): 125.
Haber M (1987). A comparison of some conditional and unconditional exact tests for 2 x 2 contingency tables. Communications in Statistics  Simulation and Computing, 16(4): 9991013. Abstract.
Liddell D. (1976). Practical test of 2 x 2 tables. Statistician, 25(4): 295304. Abstract.
McDonald LL, Davis BM, Milliken GA (1977). A nonrandomized unconditional test for comparing two proportions in a 2 x 2 contingency table. Technometrics, 19: 145150. Abstract.
Richardson JTE (1994). The analysis of 2 x 1 and 2 x 2 contingency tables: an historical review. Statistical Methods in Medical Research, 3: 107133. Abstract.
Sahai H, Khurshid A (1995). On analysis of epidemiological data involving a 2 x 2 contingency table: an overview of Fisher's exact test and Yates' correction for continuity. Journal of Biopharmaceutical Statistics, 5(1): 4370. Abstract.
Abstract: The most powerful nonasymptotic unconditioned method for comparing two proportions (independent samples) is that of Barnard (1945), but the complexity of computation has led to several simplifying versions being produced. There is no complete global comparison of these (though there are some partial ones, such as Haber's, 1987), nor has there been an evaluation of the loss incurred by not using Barnard's method. In this paper all existing relevant versions (including Barnard's, and one proposed by the authors) for a wide range of sample sizes are compared, as well as for oneand twotailed tests (only the second case has been dealt with in recent literature), and a conclusion is drawn about the suitability of the new method proposed. The comparison is effected on the basis of the new criterion of “mean power”, and the other customary criteria for comparing methods, based on the comparison of their powers in each point of the parametric space, are criticized. The new criterion can be applied to all tests based on discrete random variables. Finally, given the large number of methods proposed in the relevant literature for solving this problem, the authors classify the same in function of their precision and their complexity of computation.
Abstract: The 2×2 table has received an enormous amount of attention in the research literature. Most studies have focused on Type I error rates and the power of the chisquare statistic, but some have been more concerned with the theoretical justification behind methods of analysis. Little consensus has been achieved in either area. The reason for this is that 2 basic inferential paradigms that underlie much of the work in 2×2 tables are incompatible. Thus, empirical studies of Type I error rates of the chisquare test within the Neyman–Pearson framework are considered irrelevant by advocates of R. A. Fisher's exact test. Both approaches are described in this article. G. A. Barnard's (1947) test is shown to be theoretically superior to the chisquare test and all of its corrected cousins. However, Fisher's exact test is advocated as the most rational choice.
 Note by Patrick Onghena (1996) in ExactStats Bibliography, Part D: Comprehensive Listing: "This article by G. A. Barnard (together with the 1989 paper in Statistics in Medicine) offers strong advocacy for the 'MidP' procedure in exact significance testing."
Excerpt: However, when X is discrete, E{T(X)} > 0.5, implying that the Fisher tail areas 'are "biased" in an upward direction' (Barnard, 1989). To correct this problem Lancaster (1949) suggested the use of the midPvalue M(x), given by M(x) = P(X>x) + 0.5P(X=x). (1) Further support for the use of midP is to be found in Stone (1969) and Anscombe (1981). (p. 399)
"A problem often encountered in the practice of statistics is not that we don’t have an answer to your question, but that we have too many answers and don’t know which ones to choose as our “final” answer. A problem with an extensive literature, and many competing answers, is inference for parameters of discrete data, such as the true population proportion p when one observes K successes in n trials. What may be novel is our claim that for a proportion p of 0 − 1 data the midP frequentist confidence interval is approximately identical with the Bayesian Jeffrey’s prior credible interval. An important inference method that is unknown (and perhaps difficult to accept) to many statisticians is the “midP” approach (usually credited to Lancaster, 1961). This paper presents theory to justify this frequentist approach and argues that it can be recommended as the “final” (benchmark) answer because it is identical with the Bayesian answer for a Jeffrey’s Beta(.5,.5) prior for p. While for large samples other popular answers are approximately numerically equivalent, introductory courses will be happier if we teach only one way, the “right” way, the way that is accurate for small samples and zero successes. It is easy to compute from software for the quantile function of the Beta distribution."
Abstract: In this article, the problem of comparing two independent binomial populations is considered. It is shown that the test based on the confidence interval p value of Berger and Boos (1994) often is uniformly more powerful than the standard unconditional test. This test also requires less computational time.  Introduction: The problem of comparing two binomial proportions has been considered for many years. The most commonly used test is Fisher's Exact Test (Fisher, 1935), a conditional test. Barnard (1945, 1947) proposed an unconditional test for this problem. Although unconditional tests are usually more powerful than conditional tests, they are computationally much more complex. But recent advances in computing have made unconditional tests practical, and they are beginning to appear in statistical software packages such as StatXact 3 for Windows. In this article it is shown that unconditional tests based on the confidence interval p value of Berger and Boos (1994) are often uniformly more powerful than the standard unconditional tests.
"In Chapter 14.3.10 of the User Manual for StatXact Version 5.0.3, the Berger and Boos Correction is used for computing an unconditional exact confidence interval for the difference of two binomial parameters based on inverting two 1sided hypothesis tests. StatXact is noting that Berger and Boos (1994) actually proposed their method only for hypothesis tests and that the application to the calculation of confidence intervals is new. But having shown that the pvalues derived with the BergerBoos correction need not satisfy Barnard’s convexity condition there is no guarantee that searching only on the boundary in nonclassical (i.e. shifted) situations will successfully determine the suprema that are necessary for a correct calculation." (p. 45)
Summary: In this paper it is to be shown that Fisher's nonrandomizing exact test for 2x2tables, which is a conditional test, can by simple means be changed into an unconditional test using raised levels of significance; not seldom, especially for not too large samples, the level of significance can be doubled. This leads in many cases to a considerable increase of power of the test. A table with raised levels has been prepared up to sample sizes of 50 and a rule of thumb, which can be used if this table is not available, has been developed.
Abstract: Exact unconditional tests for comparing two binomial probabilities are generally more powerful than conditional tests like Fisher's exact test. Their power can be further increased by the Berger and Boos confidence interval method, where a pvalue is found by restricting the common binomial probability under H 0 to a 1  γ confidence interval. We studied the average test power for the exact unconditional zpooled test for a wide range of cases with balanced and unbalanced sample sizes, and significance levels 0.05 and 0.01. The detailed results are available online on the web. Among the values 10^{−3}, 10^{−4}, …, 10^{−10}, the value γ = 10^{−4} gave the highest power, or close to the highest power, in all the cases we looked at, and can be given as a general recommendation as an optimal γ.
Being Known or Assumed to be "Gay" or "Lesbian": Examples of Negative Effects on Heterosexual and Sexual Minority Adolescents
Jackson PS, Peterson J (2004). Depressive disorder in highly gifted adolescents. The Journal of Secondary Gifted Education, 14(3): 175–186. Full Text.
"Another highly gifted boy, 13, described the onset of a major depressive episode following the death of his pet canary:I felt like something irreparable happened when my bird Merlin died. I was so sick, I mean physically sick, day after day. I had the flu constantly. I felt that I had the flu; when you think about that, it is “funny,” ironic. I had the flu and Merlin “flew” away. I mean, the stress of that made me sick. I made myself sick with that — I know that I did.
When it gets really bad, when the kids at school are calling me gay — can you tell me why they call me gay? I try to cope, I try to say “no troubles, just one more year here.” But, then something happens, and it happens very, very quickly.
I leave my body. I leave my mind. . . . I don’t know if I am here. And then I know that I am dying. I mean my actual brain is here, but my soul has left my body. I am screaming inside: “It is all so stupid, stupid, stupid!” I am dying here.
This young man’s physical manifestation of distress became so severe that he could no longer leave the house. He lost his appetite and became emaciated. However, a comprehensive series of physical exams revealed no irregularities:
These tests  I feel badly about the cost and bother to the medical system. I mean, there is nothing wrong with me; nothing physically wrong with me. It is my soul that is dying here. I keep telling them that is what is wrong, and they keep searching for clues in my body. That is not what the problem is."
Wodnik B, North S (1997). Not race, but who she loved, brought abuse on Everett teen. Full Text.
She's also a lesbian. And because of that she was attacked and her ankle broken on the way home from Everett High School. She was spit on and called "dyke." She had full cans of pop thrown at her head and once received a note signed by about a dozen fellow students saying they'd pay a million dollars to see her burned at the stake in the high school auditorium. Sometimes she lay in her bed at night and prayed to God to end her life.
Donaldson James, Susan (2009). When Words Can Kill: 'That's So Gay'. ABC News. Full Text.
Carl Joseph WalkerHoover was 11 hardly old enough to know his sexuality and yet distraught enough to hang himself last week after school bullies repeatedly called him "gay."Hanlon , Joleen (2009). A Tragic Lesson in AntiGay Bullying. Education Week. Full Text.
This 11yearold [Carl Joseph WalkerHoover] sent the world a powerful message, one demonstrating just how painful words can be. His death also provided educators with another illustration of the need to address homophobic attitudes in schools. The consequences of antigay bullying may be difficult at times to see, but they can forever alter, and sometimes end, the life of a child. It is time for educators to stop overlooking antigay language and start responding to it with the same vigor we would to the expression of racist attitudes.
The fact is that this type of hate language, used against lesbian, gay, bisexual, and transgender, or LGBT, youths, is common in American schools. Students who are LGBT  or are perceived to be  are frequently bullied. In fact, sexual orientation is, according to a 2005 nationwide survey, the second most common reason for repeated harassment in schools. "Words such as 'gay,' 'fag,' and 'queer' are often used as the most hurtful insults students can throw at one another."
Badash, David (2010). Bullied 14Year Old Called “Gay, Girly, Fag” Commits Suicide. News: The New Civil Rights Movement. Full Text.
Fitz, Timothy (2010). Billy Lucas: Teen Commits Suicide After Being Called Gay and Told to Kill Himself. Chicago News Report. Full Text.
Eckholm, Erik (2011). Eight Suicides in Two Years at AnokaHennepin School District. The New York Times. Full Text.
Brittany Geldert, 14, another plaintiff, has called herself bisexual since seventh grade and said she had repeatedly been called “dyke” while teachers looked the other way. Her grades plummeted, her poetry took a dark turn and she has been hospitalized for severe depression and suicidal thoughts.
Melloy, Kilian (2011). Was Straight Teen’s Death a Result of Homophobic Bullying? Edge, Boston. Full Text.An Ohio teenager died from an apparently selfinflicted gunshot wound after suffering homophobic bullying at school, reported Akron newspaper the Beacon Journal in a story that Queerty.com picked up. The story related how Nicholas Kelo, Jr., 13, of Rittman, Ohio, suffered at s chool after fellow students assumed that he was gay for setting aside football and taking up band once he reached high school.
 (2012). Zachery Gray, [heterosexual identified] Florida Teen, Suffers Brain Damage After Suicide Attempt Allegedly Due To AntiGay Bullying. Huffington Post. Full Text.
Long Island Gay, Lesbian, Bisexual and Transgender Network (2012). Current News. Full Text.
"It is with sad news to report that David Hernandez, a 16yearold East Hampton High School student, committed suicide over a week ago. Reports have surfaced that this suicide was allegedly tied to David being bullied and harassed because of the perception that he was gay. This is a difficult time for David’s family, classmates, friends, and the entire East Hampton community. Gay teen suicide is an epidemic and this most recent suicide hits close to home as it has happened right in our own backyard. We have been working closely with East Hampton school administration and officials since this tragedy to ensure that support is available for the healing that needs to take place, and to create a plan to move forward in ensuring no other teen, GLBT or heterosexual, feels the need to take their own life."
[Important Note: There may be an ignored epidemic of heterosexual identified teen suicide associated with antigay bullying, as in being bullied/abused because they are assumed to be gay or lesbian. In the Seattle 1995 Youth Risk Behavior Survey, of the males reporting being targeted for antigay harassment and having attempted suicide, 7 were gay/bisexual identified, 7 were unsure of their sexual orientation, and 20 (the majority) were heterosexual identified. For males reporting being targeted for antigay harassment and having attempted suicide that required medical attention, 4 were lesbian/bisexual identified, 1 was unsure of his sexual orientation, and 8 (the majority) were heterosexual identified. Of the females reporting being targeted for antilesbian harassment and having attempted suicide, 16 were gay/bisexual identified, 10 were unsure of their sexual orientation, and 66 (the majority) were heterosexual identified. For females reporting being targeted for antilesbian harassment and having attempted suicide that required medical attention, 6 were gay/bisexual identified, 4 were unsure of his sexual orientation, and 27 (the majority) were heterosexual identified. To date, heterosexual identified youth have been largely ignored in studies reporting on antigay/lesbian harassment / victimization / violence.]
The American Academy of Pediatrics (2012). Cyberbullying Only Rarely the Sole Factor Identified in Teen Suicides. Full Text.
Most teen suicide victims are bullied both online and in school, and many suicide victims also suffer from depression... researchers searched the Internet for reports of youth suicides where cyberbullying was a reported factor. Information about demographics and the event itself were then collected through searches of online news media and social networks... The study identified 41 suicide cases (24 female, 17 male, ages 13 to 18) from the U.S., Canada, the United Kingdom and Australia. In the study, 24 percent of teens were the victims of homophobic bullying, including the 12 percent of teens identified as homosexual and another 12 percent of teens who were identified as heterosexual or of unknown sexual preference.
Robinson JP, Espelage DL (2012). Bullying Explains Only Part of LGBTQ–Heterosexual Risk Disparities: Implications for Policy and Practice. Educational Researcher, 41(8): 309–319. Abstract. Abstract Excerpt: Our sample consisted of 11,337 students in Grades 7 through 12 from 30 schools in Dane County, Wisconsin. Using both multilevel covariateadjusted models and propensityscorematching models, we found that although victimization does explain a portion of the LGBTQ–heterosexual risk disparities, substantial differences persist even when the differences in victimization are taken into account. For example, LGBTQidentified students were 3.3 times as likely to think about suicide (p < .0001), 3.0 times as likely to attempt suicide (p = .007), and 1.4 times as likely to skip school (p = .047) as propensityscorematched heterosexualidentified students within the same school who reported equivalent levels of peer victimization.
Takeuchi, Craig (2012). Canadianled $2million study to examine homophobic bullying in schools. Vancouver Free Press. Full Text. [Hainsworth, Jeremy (2012). UBC study to evaluate success of antihomophobia programs. Xtra, February 21. Full Text.]The study will not be limited to queer youth but will encompass how straight students are impacted as well. Contrary to misconceptions that only one particular demographic, namely queer youth, is affected, homophobia can affect and be used against all youth, whether straight or queer. Take the example of former North Vancouver high school student Azmi Jubran, who identifies as straight, who won a landmark case against the North Vancouver School District in 2005. He took the school district to the B.C. Human Rights tribunal for failing to do anything about the homophobic bullying he was subjected to for five years. "In any high school, there are far more heterosexual teens than lesbian, gay, bisexual, or questioning teens, and because of this, we have found half or more of those targeted for antigay harassment actually identify as straight," UBC School of Nursing professor and principal investigator Elizabeth Saewyc stated in a news release. "There isn't much research about them, but what there is suggests they have the same health consequences as LGBTQ youth who are bullied."
Hill C, Kearl H (2011). Crossing The Line: Sexual Harassment at School. Washington, DC: American Association of University Women (AAUW). Full Text. Full Text.
"Boys were most likely to cite being called gay in a negative way in person as their most negative experience of sexual harassment. Girls and boys were equally likely to experience this type of sexual harassment (18 percent of students surveyed), but 21 percent of boys and only 9 percent of girls identified being called gay or lesbian as their worst experience of sexual harassment." [National Survey, Grades 7 to 12]
Table AA: An Expanded Homosexuality Factor In Adolescent Suicide Combining: Those Deemed to be Homosexually Oriented, Plus Those Harassed/Abused* Because They Were Gender Nonconformable, or Likely Suspected to be Homosexual and Treated/Abused* Accordingly 

Study 
Homosexuality As Suicide Risk Factor pValues 
Study Information 
Shaffer et al. (1995) 
Fisher Test, One Sided: p = 0.008 FisherBoschloo: p = 0.005 zPooled, 2 methods: p = 0.004, 0.003 Fisher MidP: p = 0.004 Barnard's Test, Real pValue: 0.002 Estimated pValue: 0.002 Arcsine Stat: p = 0.002 
Adolescent Males & Females (See Above Table for Study Information) 3 Males  Deemed Homosexual  out of 95 Male Suicide Victims. 0 Females  Deemed Homosexual  out of 25 Female Suicide Victims Total: 3 Homosexual Suicide Victims out of 120 3 Additional Male & 0 Female Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male & 0 Female Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male & 0 Female Controls Deemed to be Homosexual. Therefore: 6 HomosexualityRelated Suicide Victims. For 2 X 2 Statistical Calculations: 6 HomosexualityRelated Suicide Victims & 114 NonHomosexuality Related Suicide Victims (6 / 120 = 5.0%)  vs  0 Homosexuality Related Controls & 147 NonHomosexuality Related Controls (0 / 147 = 0.0%). Odds Ratio Calculation = Not Possible 
Renaud et al. (2010) 
Fisher Test, One Sided: p = 0.081 FisherBoschloo: p = 0.054 zPooled, 2 methods: p = 0.048, 0.047 Fisher MidP: p = 0.048 Barnard's Test, Real pValue: 0.044 Estimated pValue: 0.047 Arcsine Stat: p = 0.047 
Adolescent Males & Females (See Above Table for Study Information) 3 Males  Deemed Homosexual  out of 43 Male Suicide Victims. 1 Female  Deemed Homosexual  out of 12 Female Suicide Victims Total: 4 Homosexual Suicide Victims out of 55 2 Additional Males & 1 Female Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male & 2 Female Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 3 Males & 1 Female Controls Deemed to be Homosexual. Therefore: 7 HomosexualityRelated Suicide Victims. For 2 X 2 Statistical Calculations: 7 HomosexualityRelated Suicide Victims & 48 NonHomosexuality Related Suicide Victims (7 / 55 = 12.7%)  vs  2 HomosexualityRelated Controls & 53 NonHomosexuality Related Controls (2 / 55 = 3.6%). OR = 0.86<3.9<17.1 
Combining the Above Shaffer et al. (1995) & Renaud et al. (2010) Study Results 
Two Studies MetaAnalytically Combined: Males & Females


Shaffer et al. (1995) 
Fisher Test, One Sided: p = 0.008 FisherBoschloo: p = 0.005 zPooled, 2 methods: p = 0.004, 0.004 Fisher MidP: p = 0.004 Barnard's Test, Real pValue: 0.002 Estimated pValue: 0.002 Arcsine Stat: p = 0.002 
Adolescent Males Only (See Above Table for Study Information) 3 Males  Deemed Homosexual  out of 95 Male Suicide Victims. 3 Additional Males Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male Controls Deemed to be Homosexual. Therefore: 6 HomosexualityRelated Suicide Victims. For 2 X 2 Statistical Calculations: 6 Male HomosexualityRelated Suicide Victims & 89 Male NonHomosexuality Related Suicide Victims (6 / 95 = 6.3%)  vs  0 HomosexualityRelated Controls & 116 NonHomosexuality Related Controls (0 / 116 = 0.0%). Odds Ratio Calculation = Not Possible 
Renaud et al. (2010) 
Fisher Test, One Sided: p = 0.028 FisherBoschloo: p = 0.016 zPooled, 2 methods: p = 0.012, 0.012 Fisher MidP: p = 0.014 Barnard's Test, Real pValue: 0.010 Estimated pValue: 0.010 Arcsine Stat: p = 0.010 
Adolescent Males Only (See Above Table for Study Information) 3 Males  Deemed Homosexual  out of 43 Male Suicide Victims. 2 Additional Male Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons. 0 Male Controls Deemed to be Homosexual. Therefore: 5 HomosexualityRelated Suicide Victims. For 2 X 2 Statistical Calculations: 5 Male HomosexualityRelated Suicide Victims & 38 Male NonHomosexuality Related Suicide Victims (5 / 43 = 11.6%)  vs  0 HomosexualityRelated Controls & 43 NonHomosexuality Related Controls (0 / 43 = 0.0%). Odds Ratio Calculation = Not Possible 
Combining the Above Shaffer et al. (1995) & Renaud et al. (2010) Study Results  Two Studies MetaAnalytically Combined: Males



"I hate being the only open gay guy in my school… It f***ing sucks, I really want to end it. Like all of it, I not getting better theres 3 more years of high school left, Iv been on 4 different anti depressants, none of them worked. I’v been depressed since january, How f***ing long is this going to last. People said “It gets better”. Its f***ing bull****. I go to see psychologist, What the f*** are they suppost to f***ing do? All I do is talk about problems, it doesnt make them dissapear?? I give up."" (Jamie Hubley, Gay 15YearOld Ottawa, Canada Teen Commits Suicide, Cites Depression, School Troubles, October 17, 2011: http://www.huffingtonpost.com/2011/10/17/jamiehubleycommitssuicide_n_1015646.html.Jamie, however, had been bullied long before he decided to become openly gay in grade 10, and he was bullied based on the 'gender nonconforming based' suspicion that he was gay: "There are some reports in the media and on social media that James was bullied. This is true. We were aware of several occasions when he felt he was being bullied. In Grade 7 he was treated very cruelly simply because he liked figure skating over hockey." (Father says ‘bullying was definitely a factor’ in son Jamie Hubley’s suicide, by Darryl Morris, October 17, 2011: http://www.lgbtqnation.com/2011/10/fathersaysbullyingwasdefinitelyafactorinsonjamiehubleyssuicide/. It should be noted that, if Jamie had ended his life via suicide before he outed himself to others, the best informers could have said was that he was "cruelly" bullied for gender nonconforming reasons, such as being called a "sissy"  more like greatly abused for being deemed a "sissy"  and likely also "gay"  because he liked figure skating. Most informers, in cases involving the suicides of such boys, might have said that the boys were "teased" for being a sissy, that would be an monumental understatement of the "terrorism" they likely experienced.
Table 1aa: Proportion difference for sexual minority males and females in the suicide and living controls groups. See: Related Information. 
Table 2aa: Proportion difference for sexual minority males in the suicide and living controls groups. See: Related Information. 
Abstract: Recently, metaanalysis has been widely utilized to combine information across comparative clinical studies for evaluating drug efficacy or safety profile. When dealing with rather rare events, a substantial proportion of studies may not have any events of interest. Conventional methods either exclude such studies or add an arbitrary positive value to each cell of the corresponding 2×2 tables in the analysis. In this article, we present a simple, effective procedure to make valid inferences about the parameter of interest with all available data without artificial continuity corrections. We then use the procedure to analyze the data from 48 comparative trials involving rosiglitazone with respect to its possible cardiovascular toxicity.
Table 1b  Arcsine Difference MetaAnalysis*: Males & Females The Homosexuality Factor in Adolescent Suicide** 






Table 2b  Arcsine Difference MetaAnalysis*: Males The Homosexuality Factor in Adolescent Suicide** 






Table 1c  Odds Ratio MantelHaenszel MetaAnalysis: Males & Females  Using "0.5" Continuity Correction* The Homosexuality Factor in Adolescent Suicide** 




Table 1d  Odds Ratio MantelHaenszel MetaAnalysis: Males & Females  Using "TAC" Continuity Correction* The Homosexuality Factor in Adolescent Suicide** 




Table 2c  Odds Ratio MantelHaenszel MetaAnalysis: Males  Using "0.5" Continuity Correction* The Homosexuality Factor in Adolescent Suicide** 




Table 2d  Odds Ratio MantelHaenszel MetaAnalysis: Males  Using "TAC" Continuity Correction* The Homosexuality Factor in Adolescent Suicide** 




Table 1e  Peto Method Odds Ratios & MetaAnalysis: Males & Females ^{1 }The Homosexuality Factor in Adolescent Suicide* 




Table 2e  Peto Method Odds Ratios & MetaAnalysis: Males ^{1 }The Homosexuality Factor in Adolescent Suicide** 




Abstract: For clinical trials with binary endpoints there are a variety of effect measures, for example risk difference, risk ratio and odds ratio (OR). The choice of metric is not always straightforward and should reflect the clinical question. Additional issues arise if the event of interest is rare. In systematic reviews, trials with zero events in both arms are encountered and often excluded from the metaanalysis.The arcsine difference (AS) is a measure which is rarely considered in the medical literature. It appears to have considerable promise, because it handles zeros naturally, and its asymptotic variance does not depend on the event probability.This paper investigates the pros and cons of using the AS as a measure of intervention effect. We give a pictorial representation of its meaning and explore its properties in relation to other measures. Based on analytical calculation of the variance of the arcsine transformation, a more conservative variance estimate for the rare event setting is proposed. Motivated by a published metaanalysis in cardiac surgery, we examine the statistical properties of the various metrics in the rare event setting.We find the variance estimate of the AS to be more stable than that of the logOR, even if events are rare. However, parameter estimation is biased if the groups are markedly unbalanced. Though, from a theoretical viewpoint, the AS is a natural choice, its practical use is likely to continue to be limited by its less direct interpretation.
"The arcsine transformation was introduced in the statistical literature for its approximative variancestabilizing property. The key advantage is that a stabilized variance also leads to more robust estimation. If the risks in the treatment arms are estimated with noise, the variance estimate of the AS is less dramatically changed than that of the logOR, even if events are rare. This is an advantage of the AS as a measure of treatment effect particularly when zero cell studies occur. A disadvantage is that if events are rare in both groups and the groups sizes are markedly unbalanced, bias will be induced by the transformation. In this situation, though, other methods are likewise prone to bias. (p. 735)
Abstract: In metaanalyses, it sometimes happens that smaller trials show different, often larger, treatment effects. One possible reason for such 'small study effects' is publication bias. This is said to occur when the chance of a smaller study being published is increased if it shows a stronger effect. Assuming no other small study effects, under the null hypothesis of no publication bias, there should be no association between effect size and effect precision (e.g. inverse standard error) among the trials in a metaanalysis.A number of tests for small study effects/publication bias have been developed. These use either a nonparametric test or a regression test for association between effect size and precision. However, when the outcome is binary, the effect is summarized by the logrisk ratio or logodds ratio (log OR). Unfortunately, these measures are not independent of their estimated standard error. Consequently, established tests reject the null hypothesis too frequently. We propose new tests based on the arcsine transformation, which stabilizes the variance of binomial random variables. We report results of a simulation study under the Copas model (on the log OR scale) for publication bias, which evaluates tests so far proposed in the literature. This shows that: (i) the size of one of the new tests is comparable to those of the best existing tests, including those recently published; and (ii) among such tests it has slightly greater power, especially when the effect size is small and heterogeneity is present. Arcsine tests have additional advantages that they can include trials with zero events in both arms and that they can be very easily performed using the existing software for regression tests.
The arcsine difference has a long history, dating back to the 1940s [55, 59, 117, 118, 119], and is often used in other contexts [58, 120, 121], but not, to our knowledge, as a measure of treatment effect in clinical trials. It is nevertheless briefly mentioned in this context in a series of references [16, 23, 56, 122, 123, 124]. Its attraction is that the arcsine transformation is the asymptotically variancestabilising transformation for the binomial distribution." (p. 78) "Transforming the binomial risk introduces bias, which will be greater for small sample sizes and rare events." (p.86)
From Section "16.9.5 Validity of methods of metaanalysis for rare events" (Full Text):
"Bradburn et al. found that many of the most commonly used metaanalytical methods were biased when events were rare (Bradburn 2007). The bias was greatest in inverse variance and DerSimonian and Laird odds ratio and risk difference methods, and the MantelHaenszel odds ratio method using a 0.5 zerocell correction. As already noted, risk difference metaanalytical methods tended to show conservative confidence interval coverage and low statistical power when risks of events were low.
At event rates below 1% the Peto onestep odds ratio method was found to be the least biased and most powerful method, and provided the best confidence interval coverage, provided there was no substantial imbalance between treatment and control group sizes within studies, and treatment effects were not exceptionally large. This finding was consistently observed across three different metaanalytical scenarios, and was also observed by Sweeting et al. (Sweeting 2004)...
Methods that should be avoided with rare events are the inversevariance methods (including the DerSimonian and Laird randomeffects method). These directly incorporate the study’s variance in the estimation of its contribution to the metaanalysis, but these are usually based on a largesample variance approximation, which was not intended for use with rare events. The DerSimonian and Laird method is the only randomeffects method commonly available in metaanalytic software. We would suggest that incorporation of heterogeneity into an estimate of a treatment effect should be a secondary consideration when attempting to produce estimates of effects from sparse data – the primary concern is to discern whether there is any signal of an effect in the data."
From Section "9.4.4.2 Peto odds ratio method" (Full Text):
Peto’s method (Yusuf 1985) can only be used to pool odds ratios. It uses an inverse variance approach but utilizes an approximate method of estimating the log odds ratio, and uses different weights. An alternative way of viewing the Peto method is as a sum of ‘O – E’ statistics. Here, O is the observed number of events and E is an expected number of events in the experimental intervention group of each study. The approximation used in the computation of the log odds ratio works well when intervention effects are small (odds ratios are close to one), events are not particularly common and the studies have similar numbers in experimental and control groups. In other situations it has been shown to give biased answers. As these criteria are not always fulfilled, Peto’s method is not recommended as a default approach for metaanalysis. Corrections for zero cell counts are not necessary when using Peto’s method. Perhaps for this reason, this method performs well when events are very rare (Bradburn 2007) (see Chapter 16, Section 16.9).
Cochrane Collaboration’s Open Learning Material for Cochrane reviewers  From Section "Combining studies: Weighted Averages" (Full Text):
The Peto method: The Peto method works for odds ratios only. Focus is placed on the observed number of events in the experimental intervention. We call this O for 'observed' number of events, and compare this with E, the 'expected' number of events. Hence an alternative name for this method is the 'O  E' method. The expected number is calculated using the overall event rate in both the experimental and control groups. Because of the way the Peto method calculates odds ratios, it is appropriate when trials have roughly equal number of participants in each group and treatment effects are small. Indeed, it was developed for use in megatrials in cancer and heart disease where small effects are likely, yet very important. The Peto method is better than the other approaches at estimating odds ratios when there are lots of trials with no events in one or both arms. It is the best method to use with rare outcomes of this type. The Peto method is generally less useful in Cochrane reviews, where trials are often small and some treatment effects may be large.
Yusuf S, Peto R, Lewis J, Collins R, Sleight P (1985). Beta blockade during and after myocardial infarction: an overview of the randomized trials. Progress in Cardiovascular Disease, 27(5): 33571. Abstract.
 Types of Continuity Correction • Constant k – typically 0.5 • ‘Treatment arm’ continuity correction (Based on the reciprocal of the opposite group size & Causes less bias when encountering severely imbalanced groups) • Empirical continuity correction (Based on an empirical estimate of the pooled effect size using the nonzero event studies) (Sweeting et al, 2004)
 Commonly Used Statistical Methods • Fixed effect models (Inverse varianceweighted (IV) method  MantelHaenszel (MH) method  Peto method  Logistic regression  Bayesian method) • Random effects models (DerSimonian & Laird (DL) method  Bayesian method.
 Comparisons of Commonly Used Existing Methods • The choice of method in a sparse event metaanalysis is important since certain methods perform poorly; especially when group imbalances exist (Bias is greatest using the IV and DL methods, and MH method with CC of 0.5) • The MH method using the alternative CC provides the least biased results for all group imbalances • At event rates below 1%, the Peto method provides least biased, most powerful and best CI coverage for balanced groups but bias increases with greater group imbalance and larger treatment effect • Logistic regression performs well and generally unbiased and reliable • The Bayesian fixedeffect model performs consistently well irrespective of group imbalance • Alternative CCs perform better than a constant CC. (Sweeting et al, 2004; Bradburn et al., 2007)
 A New Advance: Exact Inference Procedure • A new method using exact inference procedure proposed by Tian, et al (2009): Combine across trial information without excluding doublezero studies or continuity correction & Exact inference without relying on large sample approximations. • RD with associated exact CI can be constructed using the procedure (CI can be over conservative in some cases).
We estimated the pooled odds ratio as our measure of effect size by using fixedeffects (for example, Mantel–Haenszel) and randomeffects (DerSimonian–Laird) models (8). When applicable, we used methods with or without 2 continuity corrections. One is a constant correction (CC) that adds values of 0.5 to all cells of the 2 × 2 contingency table of the study selected for correction. The other is a treatment arm correction (TAC) that adds values proportional to the reciprocal of the size of the opposite treatment group. (See the Appendix, for details.) ... Appendix: This Appendix describes the continuity corrections applied to myocardial infarction data for trial number 49653/04 in Nissen and Wolski's metaanalysis (1). The CC for continuity adds 0.5 to each cell of the 2 × 2 contingency table, effectively increasing the treatment and control group sizes by 1 and the total study sample size by 2 (from 348 uncorrected to 350 corrected). The TAC for continuity adds a value proportional to the reciprocal of the size of the opposite treatment group, normalized to a sum of 1 for event and noevent cells, resulting in an increase in the total study sample size by 2 (identical to that in the CC for continuity). With R being the ratio of group sizes and S being the sum of corrections for event and no event cells, the TAC for continuity adds a factor of R/S*(R + 1) to the larger group and 1/S*(R + 1) to the other group. In the example shown (Appendix Table), S is set to 1 and R is 232/116 = 2. The correction in the (larger) treatment group becomes 2/1*(2 + 1) = 2/3 = 0.67, and that in the (smaller) control group becomes 1/1*(2 + 1) = 1/3 = 0.33.
"Fixed effect: The fixed effect model assumes that all studies in the metaanalysis share a common true effect size. Put another way, all factors which could influence the effect size are the same in all the study populations, and therefore the effect size is the same in all the study populations. It follows that the observed effect size varies from one study to the next only because of the random error inherent in each study.
Random effects: By contrast, the random effects model assumes that the studies were drawn from populations that differ from each other in ways that could impact on the treatment effect. For example, the intensity of the intervention or the age of the subjects may have varied from one study to the next. It follows that the effect size will vary from one study to the next for two reasons. The first is random error within studies, as in the fixed effect model. The second is true variation in effect size from one study to the next."
Abstract: There are two popular statistical models for metaanalysis, the fixedeffect model and the randomeffects model. The fact that these two models employ similar sets of formulas to compute statistics, and sometimes yield similar estimates for the various parameters, may lead people to believe that the models are interchangeable. In fact, though, the models represent fundamentally different assumptions about the data. The selection of the appropriate model is important to ensure that the various statistics are estimated correctly. Additionally, and more fundamentally, the model serves to place the analysis in context. It provides a framework for the goals of the analysis as well as for the interpretation of the statistics. In this paper we explain the key assumptions of each model, and then outline the differences between the models. We conclude with a discussion of factors to consider when choosing between the two models.
Conclusion: In summary, many of the commonly used methods for metaanalysis give inappropriate answers when data are sparse. The choice of the most appropriate method depends on the anticipated background event rate and structure of the trials. No method gives completely unbiased estimates in any circumstance when events are rare. At event rates below 1 per cent the Peto onestep odds ratio method appears to be the least biased and most powerful method, and provides the best confidence interval coverage, provided there is no substantial imbalance in treatment and control group sizes within trials, and treatment effects are not exceptionally large. In other circumstances the MH OR without zerocell corrections, logistic regression and the exact method perform similarly to each other, and are less biased than the Peto method. (p. 75)
Abstract: In this article, the authors outline methods for using fixed and random effects power analysis in the context of metaanalysis... The authors also show how the typically uninformative retrospective power analysis can be made more informative. The authors then discuss the value of confidence intervals, show how they could be used in addition to or instead of retrospective power analysis, and also demonstrate that confidence intervals can convey information more effectively in some situations than power analyses alone. Finally, the authors take up the question “How many studies do you need to do a metaanalysis?” and show that, given the need for a conclusion, the answer is “two studies,” because all other synthesis techniques are less transparent and/or are less likely to be valid. For systematic reviewers who choose not to conduct a quantitative synthesis, the authors provide suggestions for both highlighting the current limitations in the research base and for displaying the characteristics and results of studies that were found to meet inclusion criteria.
Excerpt: "Metaanalysis, however, provides a method for taking advantage of the relevant information comprising the statistical significance tests in the studies (i.e., effect sizes and their precision), avoids the problems associated with using the statistical conclusions arising from individual tests, and does so in a transparent and replicable way. In this sense, the answer to the question ‘‘How many studies do you need to do a metaanalysis?’’ is ‘‘two.’’ Not because it is ideal but rather because given the need for a conclusion (e.g., an administrator who needs to pick a program), it is a better analysis strategy than the alternatives. When it is legitimate not to synthesize studies? We do believe that there are times in which summarizing the results of multiple studies is not appropriate. For example, Cooper (2003) suggests that a metaanalysis of two studies will likely only be informative if the studies are direct (or ‘‘statistical’’) replications of one another. The combination of very few studies with very different characteristics makes any kind of synthesis untenable in most cases." (p. 241)
Sexual Minority Demographics: 17 High Schools in Three Canadian Cities: Toronto, Kingston, and Montreal * 

Categories 
Heterosexual 
Gay / Lesbian 
Bisexual 
Questioning 

N's (% of Total = 3,636 Students) 
3,506 (96.4%) 
12 (0.33%) 
50 (1.37%) 
68 1.87% 

Sex: M (Males), F (Females) 
M 
F 
M 
F 
M 
F 
M 
F 
n's: Totals = Males = 1,708 Females: 1,928 
1,648 ** 
1,858 ** 
9 
3 
15 
35 
36 
32 
Percentage in Category 
47% 
53% 
75% 
25% 
30% 
70% 
52.9% 
47.1% 
Percentage in Sex Category 
96.48% 
96.37% 
0.52% 
0.15% 
0.88% 
1.81% 
2.11% 
1.66% 
Gay / Lesbian & Bisexual % 
Gay / Bisexual Males: 1.40% 
Lesbian / Bisexual Females: 1.96% 

Data Source: Williams et al. (2003)

Montreal High Schools Survey: 2004 * 

Categories 
100% Hetero sexual 
Hetero. ID. Some Homo. 
Gay / Lesbian ** 
Bi sexual 
ID Unsure 

N's, Total: 1,856 
1,624 87.5% 
115 6.20% 
7 0.37% 
51 2.75% 
59 3.18% 

Sex: M (Males), F (Females) 
M 
F 
M 
F 
M 
F 
M 
F 
M 
F 
n's, Males = 941 Females = 915 
867 
757 
30 
85 
2 
5 
18 
33 
24 
35 
Percentage in Category 
53.4% 
46.6% 
26.1% 
73.9% 
28.6% 
71.4% 
35.3% 
64.7% 
40.7% 
59.3% 
Percentage in Sex Category 
91.1% 
82.7% 
3.2% 
9.3% 
0.21%  0.55% 
1.9% 
3.6% 
2.5% 
3.8% 
Gay / Lesbian & Bisexual % 
Gay / Bisexual Males: 2.11% 
Lesbian / Bisexual Females: 4.15% 

Data Source: Zhao et al. (2010) * Anonymous pencil & paper survey of 1,856 students 14 years of age and older from 14 public and private high schools in Montréal, Québec. 
North American National Adult Surveys 

Categories 
Sex n's * 
Heterosexual 
Gay or Lesbian 
Bisexual  Unsure 
National Epidemiologic Survey on Alcohol and Related Conditions Wave 2: 2004/05: USA 

Bolton & Sareen (2011) 
M 14,481 
14,109 97.4% 
190 1.31% 
81 0.56% 
101 0.70% 
F 19,896 
19,489 97.95% 
145 0.73% 
161 0.81% 
101 0.51% 

Gay / Lesbian & Bisexual % 
Gay / Bisexual Males: 1.87% 
Lesbian / Bisexual Females: 1.54% 

Canadian Community Health Survey data: Cycle 2.1, 2003 

Brennan et al. (2010) 
M 49,901 
49,065 98.3% 
536 1.07% 
300 0.60% 
 
Steele et al. (2009) 
F 61,715 
60,937 98.74% 
354 0.57% 
424 0.69 
 
Gay / Lesbian & Bisexual % 
Gay / Bisexual Males: 1.67% 
Lesbian / Bisexual Females: 1.26% 

Gay / Lesbian & Bisexual % 2 Studies 
Gay / Bisexual Males: 1.67% + 1.87% = 3.54% / 2 = 1.77% 
Lesbian / Bisexual Females: 1.26% + 1.54% = 2.8% / 2 = 1.40% 

* M: Males  F: Females 
Bailey JM, Zucker KJ (1995). Childhood sextyped behavior and sexual orientation: A conceptual analysis and quantitative review. Developmental Psychology, 31(1): 4355. Abstract.
Renaud J, Berlim MT, Begolli M, McGirr A, Turecki G (2010). Sexual orientation and gender identity in youth suicide victims: an exploratory study. Canadian Journal of Psychiatry, 55(1): 2934. PubMed Abstract. Full Text.
"Problems due to small event rates were avoided because of use of methods for exact odds ratios for counts less than or equal to 3. Because of this modification, the performance of the methods may be slightly different from that presented previously in the literature." (p. 102)
One way to make valid statistical inferences in the presence of small, sparse or unbalanced data is to compute exact pvalues and confidence intervals, based on the permutational distribution of the test statistic. (p. 1)
Recommendation 2.17: Researchers must continue to move beyond a sole reliance on statistical significance in interpreting quantitative research in suicidology to address issues of the clinical and practical usefulness of their results. (p. 22)
In Chapter 2: General Methodological Issues.
"Suicidologists have had a great difficulty in identifying meaningful correlates and predictors of suicidal behavior. Because of this Neuringer and Kolstoe (1966) suggested adopting less stringent criteria for statistical significance in suicide research, perhaps allowing rejection of the null hypothesis at the 10% level instead of the 5% level. This is an intriguing idea which has never been followed up, but it would result in the appearance of a larger proportion of "significant" results that were never replicated." (p. 21)
"This study demonstrated that mothers in the community are mostly unaware of the suicide ideation and attempts of their adolescents and hardly recognize their emotional and behavioral difficulties."