Professor Richard Ramsay
Richard Ramsay: To Home Page
Email to Richard RamsayEmail to Pierre Tremblay
To Home of: Increasing Awareness of Gay, Lesbian, Bisexual, Transgender, Two Spirit, Queer... Suicide Issues
University of Calgary Home Page

Adolescent Gay / Lesbian Suicide Risk

Supplemental Information for the 2013 Paper:
"Suicide Risk and Sexual Orientation: A Critical Review."
Archives of Sexual Behavior, 42(5): 715-727.

Online First (Feb. 26, 2013)
Authors: Plŏderl M, Wagenmakers EJ, Tremblay P,  Ramsay R, Kralovec K, Fartacek C, Fartacek R.
ResearchGate Full Text. - Draft of Paper. - Dr. Martin Plöderl Website.

PDF  Download (Alternate Link, Download Page) (PDF Download): Basic Information Supplement Submitted For Peer Review.
Most of the Information is Reproduced & Expanded Upon in This Section.


A Critical Examination of the Shaffer et al. (1995) & Renaud et al. (2010)
Psychological Autopsy Studies of Adolescent Suicides &
An Expanded Homosexuality Factor
in Adolescent Suicide

Faculty of Social Work Home Page
 

To Search For Anything At This Site!

Talking About Suicide & LGBT Populations (2011): Suggestions & Recommendations.


Webpages
Available
At Website

Suicidality Studies Index: All Studies: The Index. - All Random & Special Sample Studies. - All American & Canadian Studies. - All European Studies. - Transgender Studies. - The Results of Additional School-Based North American Youth Risk Behavior Surveys or Similar Surveys - Random Sampling - are Located at Another Location.

Other Pages: Homosexually Oriented People Are Generally at Greater Risk for the More Serious Suicidal Behaviors. - "Attempting Suicide" as Related To Gender Nonconformity & Transgender Issues. - Bell & Weinberg (1978) Homosexualities Study: "Attempted Suicide" Study Results.

Special Section: The 2013 Paper, "Suicide Risk and Sexual Orientation: A Critical Review," Reverses the Conclusions of Two Previously Published Papers. The Re-Analysis - Including Many Meta-Analyses & Using Unconditional Tests for Statistical Significance - Indicates that "Gay/Lesbian/Bisexual Adolescents Are at Risk for Suicide." (This Page) - In Addition, Expanding the "At Risk" Category to Include Adolescents Known to Only Have Been Harassed/Abused - Because They Were Assumed to be Gay/Lesbian - Produces More Conclusive Results, Especially Applying for Males. This Category Represents "An Expanded Homosexuality Factor in Adolescent Suicide." (This Page) - Associated Pages: Constructing "The Gay Youth Suicide Myth": Thirty Years of Resisting a Likely Truth & Generating Cohen's Effect Size "h" Via Arcsin / Arcsine Transformations.


The Related 2013 American Association of Suicidology Conference Card Handout:

Conference Card
       

This Webpage was developed by Pierre Tremblay, Martin Plöderl and Richard Ramsay.

  • Note: On the webpage there are multiple meta-analytic combination of varied results from two psychological autopsies of adolescent suicide deaths. Complications were generally related to the presence of "zero events" in one cell of each study that required special considerations and analytical techniques available in the statistics literature. Some readers may be surprised that only two studies are in the meta-analyses, but using the methodology is the best available to combine two studies, said to be acceptable especially if the latter of the two studies is a replications of the first study - as it applies herein (Valentine et al., 2010). Issues related to appropriate Null Hypothesis Testing as related to "Low Counts" / "Rare Events" studies are presented and related problems are addressed.

Contents

  • Had the authors of the two studies used more appropriate null hypothesis tests, they might have been concluded that sexual minority adolescents were likely more at risk for dying by suicide. However, it was concluded that such a risk difference did not exist. After exploring these issues, it is shown, via using more advanced statistical methods - Bayesian Analyses, Meta-analysis and more powerful significance tests - that the higher suicide risk for sexual minority adolescent most likely exists.
  • Introduction: The Basics in the Shaffer et al. (1995) & Renaud et al. (2010) studies and Related Issues: Null Hypothesis Testing Misunderstandings / Problems & Ignoring potentially important observed differences.

  • The Shaffer et al. (1995) & Renaud et al. (2010) psychological autopsy studies of adolescent suicides only explored the suicide results for adolescents deemed to be homosexually oriented, but the evidence suggests that, in addition to these adolescents, adolescent not known to be homosexually oriented - but targeted for harassment based on other adolescents' beliefs that they are gay or lesbian (related information was solicited in both studies) - would likely also be at greater risk for dying by suicide. Hence, five new analyses are carried out with the data from the two studies to explore the suicide risk of adolescents who are homosexually oriented combined with those targeted for anti-gay harassment but were not known to be homosexually oriented. Using more appropriate null testing methods as it was done in the first analysis (Statistical Significance Tests) and more advanced statistical methods - Bayesian Analyses & Meta-analysis (Table 1a) - Arcsine Difference Meta-Analysis (Table 1b) - and Odds Ratio Meta-Analysis using Continuity Corrections, two methods: (Table 1c, Table 1d) and the Peto Method deemed to be the best available for studies with low counts and maybe zero events (Table 1e) - it is shown that these sexual minority adolescents are most likely at a greater risk for suicide. The five analyses are also carried out only for the males in both studies in which homosexually oriented males - combined with males reported to have been targeted for anti-gay harassment - accounted for the great majority of suicide deaths (all or almost all) in the homosexuality-related category. Results for homosexuality related males are generally more indicative of greater suicide risk for suicide than they are for homosexuality related males and females analyzed together (Tables 2a, 2b, 2c, 2d, 2e).    


Part 1: A Critical Examination of the Shaffer et al. (1995) & Renaud et al. (2010)
Psychological Autopsy Studies of Adolescent Suicides
:
Are Sexual Minority Adolescents at Greater Risk for Suicide?

Introduction

The Shaffer et al. (1995) and Renaud et al. (2010) are studies that compared the proportion of homosexually oriented individuals in a group of adolescents who died by suicide (3 / 120: 2.50% & 4 / 55: 7.27%, respectively) to the proportion of homosexually oriented living individuals in a control group of adolescents (0 / 147: 0.0% & 0 / 55: 0.0%, respectively).

On the basis of the statistically nonsignificant result of the One-Sided Fisher Exact Test (p = 0.09) [two-sided test, p = 0.09], Shaffer et al. (1995) made the following conclusion: "In spite of opportunity for biased reporting, it is concluded that this study finds no evidence that suicide is a common characteristic of gay youth" (p. 64) and that "...the data here suggest that the painful experience of establishing a gay orientation does not lead disproportionally to suicide" (p. 71). For the same reason
(p = 0.06) [two-sided test, p = 0.118], Renaud et al. (2010) concluded, "In our sample, same-sex sexual orientation and gender identity issues do not appear to be more prevalent among youth who die by suicide, compared with youth recruited from the general population" (p. 29). These “no-difference” conclusions, however,  must be  caveated, because there are wrong interpretations of nonsignificance, a problematic choice of the statistical test, and a very clear reversal of interpretation after combination of the two studies with Meta-Analytical and Bayesian statistics. Information related to both studies - counts, statistical significance results, and study details - is given in a table below.



Absence of evidence is not evidence of absence

One major problem with the conclusions made in both studies is the assumption that statistical nonsignificance (p ≥  .05) has been interpreted as 'a no differences' between the two compared groups, also meaning the acceptance of the Null Hypothesis (H0: “Zero difference”) and rejection of the Alternative Hypothesis (H1: “Nonzero difference”). Most statistical textbooks informs users of statistical about this erroneous conclusion. Aberson (2002a)
highlights the problem in his abstract:
"Presenting results that “support” a null hypothesis requires more detailed statistical reporting than do results that reject the null hypothesis. Additionally, a change in thinking is required. Null hypothesis significance testing do not allow for conclusions about the likelihood that the null hypothesis is true, only whether it is unlikely that null is true."
For peer reviewers of scientific papers, Elsevier Publishers make available the same advice via a document produced by Tony Brady (2005/2008):
"Authors with 'negative' results (i.e. found no difference) should not report equivalence unless sufficiently proven - "absence of evidence is not evidence of absence.""
WikiAnswers effectively replied to the question "Does a hypothesis test ever prove the null hypothesis?
"I could be mistaken on this, but research is not usually, if ever, designed to "prove" the null hypothesis. The idea, hope or expectation is that your design will give you sufficient reason to "reject" the null hypothesis. Under virtually all circumstances, it would be shabby work to take a study designed to do one thing and automatically conclude that undesired results "prove" some alternative, including the null hypothesis. The best you can do is to conclude that the null hypothesis cannot be rejected -- which is a far cry from proving it. Another take on this is that the null hypothesis is really an abstraction, and not of any practical use in and of itself."
Schlag (2011) offers important Null hypothesis testing information, with related warnings:
2.5 Accepting the Null Hypothesis: Null hypotheses are rejected or not rejected. One does not say that the null hypothesis is accepted as apparent in the conclusions made in the Shaffer & Renaud studies. Why? Not being able to reject the null hypothesis can have many reasons. It could be that the null hypothesis is true. It could be that the alternative hypothesis is true but that the test was not able to discover this and instead recommended not to reject the null [will be shown to apply in the Shaffer & Renaud studies]. The inability of the test to discover the truth can be due to the fact that it was not sufficiently powerful, that other tests are more powerful [these tests to be given later with results generated for the Shaffer & Renaud studies]. It could be that the sample size is not big enough so that no test can be sufficiently powerful [applies in the Shaffer & Renaud studies].
Gelman & Stern (2006) emphasize an important danger related to statistical testing:
"As well, introductory courses regularly warn students about the perils of strict adherence to a particular threshold such as the 5% significance level. Similarly, most statisticians and many practitioners are familiar with the notion that automatic use of a binary significant/ nonsignificant decision rule encourages practitioners to ignore potentially important observed differences."

  • The Shaffer and Renaud study data produced One-Sided Fisher Exact Test p-values of 0.09 and 0.06, respectively, meaning that the criteria was met for not rejecting the Null Hypothesis (p >= 0.05), the p-value being defined as:
  • "The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true." (Sestini & Rossi, 2009)
  • "[T]he probability of obtaining the difference observed, or one that is more extreme [Emphasis mine, see related excerpt, Hubbard & Lindsay, 2008], considering the null is true." (Biau et al., 2010)
  • A p < 0.05 would mean that the Null Hypothesis is rejected and this also "means there is only a small chance that the obtained result would have occurred if H0 [the Null Hypothesis] were in fact true" (Aberson, 2002a: 37).
  • Although p >= 0.05 is the criteria being met for not rejecting the Null Hypothesis, such p-values do not mean that the Hull Hypothesis is true! Why? Because we began with this assumption: "The Null Hypothesis is true!" Therefore, the best that can happen after that, given that all calculations related to a study are based on the Null Hypothesis being true, is to produce evidence that the Null Hypothesis may be true, expressed as p-values being >= 0.05 (5%), meaning it should not be rejected.
  • Note that not rejecting the Null Hypothesis - as in therefore assuming that the differences in a study are not statistically significant - could be an error... a Type II Error that has long been reported to be common - also called "conservatism" - with the Fisher Exact Test (Hirji et al.. 1991; Mehrotra et al., 2003; Hasselblad & Lokhnygina, 2007).
  • A p  > 0.05  value means: "the probability is greater than 1 in 20 [5 in one hundred, 5%] that a difference this large or larger [Emphasis mine, see related excerpt, Hubbard & Lindsay, 2008] could occur by chance alone” (Goodman, 2008: 137), when assuming that the Null Hypothesis is true. Had the results been < 5% (0.05, the chosen α value), we could have said that the study result was unlikely due to chance alone, as based on the assumed Null Hypothesis. The situation is more like the probability that "only the chance factor" applies was so low... less than 1 in 20 (< 5%, p < 0.05)]... that the following conclusion would ensue: there are therefore some inherent or real difference between the groups in a study and that the assumed Null Hypothesis is to be rejected. Given that the 'criteria' is < 5% for rejecting the "Null Hypothesis" - that is quite demanding (especially for low count samples) - the "Null Hypothesis" was not rejected in both studies. As discussed above, the p > 0.05 results in both studies also do not mean that the Null Hypothesis is true, nor does such testing "allow for conclusions about the likelihood that the null hypothesis is true" (Aberson, 2002a: 36) 
  • Nonetheless, there was still a low probability that the results would have occurred by "chance alone" when in fact the Null-Hypothesis was true is low in both studies: 9 / 100 (0.09, 9%: one- & two-sided)) & 6 / 100 (0.06, 6%: one-sided) - 11.8 / 100 (0.118, 11.8%: two-sided). Even if the Null Hypothesis was not to be rejected, it was still possible that the suicide groups and controls were maybe inherently different (also meaning a possible Type II Error: not rejecting the Null hypothesis when it should have been rejected, this being a common problem with the Fisher Exact Test: Cashen & Geiger, 2004). Maybe then the differences - even if being declared statistically nonsignificant - required a more critical evaluation. For example, the Renaud study's 6% result (p = 0.06) was maybe not necessarily the result of the magnitude of difference, but at least partly because p-values are more likely to be > 0.05 and in error when study samples are small, events are rare, and differences are small to moderate, or even large (Mehrotra et al., 2003). This issue will be discussed later.
  • For the Renaud study N's (4 / 55 vs. 0 / 55), it was also possible that the sample was large enough, but that using the Fisher Exact Test was inappropriate. That is, more appropriate - and more powerful - Null Hypothesis tests are maybe available for studies with low counts and when zero events occur. This will later be shown to apply in both the Shaffer and Renaud studies.
  • For more information about Null Hypothesis testing, p-values and related issues, see: Cook (2010), Goodman (1999), Hubbard & Armstrong (2006), Hubbard & Bayarri (2003, 2003a), Moran (2006), Moran & Solomon (2004), Panagiotakos (2008), Sellke et al. (2001), Senn (2001), Stang et al. (2010), Sterne (2002), and Verhagen et al. (2004).
Were there issues related to the Shaffer et al. (1995) study that should have been explored? In the male and female adolescent suicide group (n = 120), three (n = 3) were classified to be homosexual, while none (n = 0) were classified as homosexual in the control group (n = 147) as determined by selected informers. There is, of course, an observed difference between the two groups and, in such a situation, it might have been better to conclude that the control group was not large enough to produce at least one homosexual individual, with two being better for a determination of statistical significance and magnitude of possible risk. With no homosexual individuals in the group of 147 controls, it would be somewhat reasonable to speculate that, had the control group been twice the size (n = 147 x 2 = 294), maybe the tally for homosexual individuals might have remained at "0". In such as case, the one-sided Fisher Exact Test would have been p = 0.024, meaning that the Null Hypothesis would be rejected. For a possible larger control groups in such a study, with no homosexuals in the control group (n = 0), rejecting the Null Hypothesis would have occurred beginning at n = 205 (and for all n > 205), with n = 205 producing a p-value of 0.0495. Had "one" homosexual individual been produced in the enlarged control groups, however, let's say for n = 205, the p-value would be 0.114. Assuming, however, that one homosexual individual had been reported to exist in a control sample twice the original size (n = 294), the p-value would be 0.075 Taylor Series OR: 0.77<7.51, 72.95), that is getting close to the < 0.05 value needed to reject the null hypothesis (also often called "Statistical Significance"). If we again double the size of the control group, to n = 588 with two homosexual individuals in the group, the results would have been p = 0.037. In such a case, the Odds Ratio (OR, 95%CI) would have been: 1.24<7.51<45.45 (Taylor Series OR), meaning that the greater odds for homosexual adolescents to die by suicide is 7.5 (Risk Ratio = 7.35). Had there been three homosexual individuals in the n = 588 control group (n = 3), the OR would have been 0.997<5.03<25.08 (Taylor Series OR), p = 0.064. Such risk factors cannot be calculated when there are no homosexual individuals (n = 0) in the control group. (OR calculations carried out at: OpenEpi.) For the Shaffer study, had the control group been increased until one homosexual individual was in the group, the lowest possible OR produced would have been 0.39<3.77<36.7 (One-Sided Fisher test: p = 0.237) if that homosexual individuals had occurred at the 148th individual in the enlarged control group: an unlikely outcome. More likely, given that the first group of 147 control adolescents produced "0" homosexual individuals, would be that the next 3 groups of 147 controls might each produce "0", "1", or "2" homosexual individuals.

The Renaud et al. (2010) study essentially used the same methodology as the Shaffer (1995) study but the Renaud et al. researchers appear to not have carefully addressed the problems of sample size in determining the number of homosexual individuals that could be expected in both the adolescent suicide and control sample, nor did they carefully address related statistical significance problems that could be expected when analyzing results from small samples. Therefore, in their exploratory study, there were only 55 adolescents who died by suicide and 55 matched adolescents in the control group that should have produced considerable concern that money might be wasted on a study with such small sample sizes given the expected inconclusive results with a small size study sample. As noted by Sedlmeyer and Gigerenzer (1989), “in the case of unknown and probably low power, a nonsignificant result signifies that no conclusion should be drawn, that is, that one should not affirm the null hypothesis with an uncontrolled error rate (beta error) and that the experiment probably was a waste of time and money” (p. 312). Nonetheless, their report of 4 homosexually oriented individuals in the suicide group 4/55 (7.3%) compared to a lower percentage expectation based on the Shaffer’s results (3/120 = 2.5%) - or 3-times the proportion of homosexual individuals who died by suicide - proved fortuitous. However, they were left with the same likely expected problem in the control group: no homosexually oriented individuals as determined by selected informers: the same type of informers who determined that those who died by suicide were homosexually oriented. With their data, the one-sided Fisher Exact Test was p = 0.059, which is close to the < 0.05 p-value required to reject the Null Hypothesis. If this study had a slightly larger control group (n = 60, or only five more individuals) and no homosexually oriented individuals (n = 0) it would have produced the One-Sided Fisher Exact Test p-value of 0.049. Had the control sample been the same size as Shaffer’s (n = 147) and no homosexuals, the resulting p-value would be 0.005. Had it been doubled (n = 110), with just one individual determined to be homosexually oriented, the One-Sided Fisher Exact Test (one- or two-sided) would be p = 0.043, with the OR = 0.93<8.55<78.42 (Taylor Series OR). For the Renaud study, had the control group been increased until one homosexual individual was in the group, the lowest possible OR produced would have been 0.47<4.31<39.88 (Taylor Series OR, One-Sided Fisher test: p = 0.176) if that homosexual individuals had occurred at the 56th individual the control group: an unlikely outcome. (OR calculations carried out at: OpenEpi.) More likely, given that the first group of 55 control adolescents produced "0" homosexual individuals, would be that the next 3 groups of 55 controls might each produce "0", "1", or "2" homosexual individuals.
One problem in the two studies is that we have the minimum "sexual minority" counts needed in the 2 suicide groups to begin doing more revealing statistical work, but only if we had similar counts in the matched control groups where there are "0" counts. We know, however, that, given larger sizes for the control groups, "sexual minority" counts would eventually appear, noting that these counts are determined by 'researcher selected' informers for both groups such as having a parent or close friend, for example, reporting that an adolescent who is in the living control group, as it was done for those who died by suicide, is/was a nonheterosexual / homosexually oriented adolescent. If the counts were low, this would mean using appropriate statistical methods (tests of statistical significance - that will reject the null - the "no difference" - hypothesis) to show that a greater risk does exist. With more counts, however, in both categories, and especially in categories that might have "0" counts, it would be possible to answer the most important question. This happens to be calculations of Odds Ratios (ORs) - that are a direct measure of effect sizes - that represent the magnitude for the greater risk that sexual minority adolescents would at risk of dying by suicide. For example, we might be able to state that the odds of sexual minority adolescents dying by suicide is 6-times greater than for heterosexual adolescent, with the 95% probability that identical repeated studies will produce OR values from maybe 2 to 12 (the confidence intervals). The above exploration of counts in both studies 'loosely' suggest that, had the researchers increased the size for control sample, ORs of about 7 or 8 might have been produced with statistically significant or near-significant results, and that the absolute minimum OR would have been about 4, noting that this lowest OR would be unlikely.

In the following indented section, a vexing problem with the two studies may now be made clearer by placing "you", the reader, in the position of someone who "actually" wanted to know if sexual minority adolescents are at greater risk for dying by suicide. This 'exercise'  will also include assuming that the two groups of researchers are one and the same.
“In 1995, I was anxiously awaiting the results from researchers I had paid to investigate if sexual minority adolescents are at greater risk for dying by suicide. When they finally reported their results, I was told that there were only 3 "sexual minority" deaths in the suicide group, and no "sexual minority" individuals in the control group (n = 147). They also informed me - as the result of the one-sided Fisher Exact Test (p = 0.09) - that, even if it looked like there are differences in the two group, the nonsignificant results of the test meant that there are no significant differences between the groups and that sexual minority adolescents are NOT at greater risk for dying by suicide. This result troubled me for years because, from 'somewhere' within me, I felt that there were differences: sexual minority adolescents in the suicide group and none in the control group.

About five years later (2000), I asked the same researchers to do a similar study because I wanted to know if sexual minority adolescents might have since become at greater risk for dying by suicide. When they reported their results to me in 2010, I simply could NOT believe what they had done. It was obvious to me that there had been a low count problem with the first study, but this time they had compared samples that were even smaller: 55 vs. 120 adolescent suicides in the first study, and 55 vs. 147 for the control groups, respectively. Fortunately, there were a greater percentage of "sexual minority" individuals in the suicide group (4 / 55 = 7.3%, compared to 3 / 120 = 2.5% in the first study) that would make a statistical analysis again possible, but there were still "0" sexual minority adolescents in the control group (n = 55) that certainly could have been expected given that the previous study had produced the same result (0 counts) in a sample about 3-times the size: 147 adolescents. Again, I was told what had been said about the first study: 'as the result of the Fisher Exact Test (p = 0.06) - even if it looked like there are differences in the two group and even if results are very close to statistical significance, the nonsignificant results of the test means that there are no significant differences and that sexual minority adolescents are NOT at greater risk for dying by suicide.'

I felt uneasy about the reported research outcomes because there were always adolescents in the suicide groups and none in the control groups:


The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Counts in the Study Sample
Categories
Shaffer et al. (1995)
Renaud et al. (2010)
Suicide Sample
Control Sample
Suicide Sample Control Sample
Adolescent Sample
Sizes (N)
120
147
55
55
Sexual Minority
Individuals (n)
3
0
4
0
% Sexual
Minority

2.5%
0.0%
7.3%
0.0%

Basic Observation: There are always sexual minority adolescents in the suicide group, and always none in the control group
.

I even began to wonder if, for some reason, these researchers might have a biased agenda in maybe never wanting to report that "sexual minorities" are at greater risk for suicide. Yet, it seemed to me that they were at a greater suicide risk given the numbers produced in the two studies. I therefore decide to get second opinions and also educate myself about statistical testing. I finally paid researchers well versed in sexual minority suicidology and statistics to re-evaluate the two studies. The great surprise was that the researchers who did the original two studies had been misusing statistical testing: it was inappropriate for them to declare that 'nonsignificant statistic test result also meant that there were actually no difference between the two groups'. Even worse, they did not mention anything about the many reports, common by the 1980s, but also existing before then, that the Fisher Exact Test often produces very conservative estimates when used with small independent binomial samples. This means that other more appropriate statistical significance testing procedures might show that, for the studies, statistical significance may exist: that the Null Hypothesis should be rejected. By using improved methodologies in statistics, they showed that, in spite of the low counts in both studies, the new statistical significance results indicated that sexual minority adolescents were at greater risk for suicide.

It is now important that, to achieve an appropriately powered study sample, similarly conducted psychological autopsy studies willl likely need to have a minimum of about 100 adolescents in the suicide group, with control group sizes selected so that there is an absolute minimum of about "one count" for sexual minority adolescents; however, a count of 3 might be a preferred minimum given estimate uncertainties when events range from 0 to 2. Meanwhile, we are left with the possibility that such new studies might produce Odds Ratios (ORs) from 5 to 8. This is possible given the results of 'playing' - as done above, with what what might be the sexual minority counts if the control group in both studies had been larger (producing ORs from about 5 to 8) - or by adding a small amount to each cell (a continuity correction, such as 0.5) - when there is a zero in one cell that is making the calculation of an odd ratio (OR) impossible. This procedure produces ORs of  8 to 10 (but with wide confidence intervals), as given below in the Mantel-Haenszel Meta-Analysis.


Combining the Shaffer et al. (1995) & Renaud et al. (2010) Studies via Meta-Analysis


The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Counts With Continuity Correction* & Related Meta-Analysis
Categories
Shaffer et al. Renaud et al.
Suicides
Controls
Suicides
Control
Homosexual: Yes
3 + 0.5 =
3.5
0 + 0.5 =
0.5
4 + 0.5 =
4.5
0 + 0.5 =
0.5
Homosexual: No
117 + 0.5  =
117.5

147 + 0.5 =
147.5
51 + 0.5  =
51.5

55 + 0.5 =
55.5
Odds Ratio (95% CI)
8.79 (0.45 - 171.81) 1
9.02 (0.46 - 176.34) 2

9.70 (0.51 -184.62) 1
9.70 (0.51 - 184.62)
2
Two Studies Combined
Via
Mantel-Haenszel Meta-Analysis
OR: 9.37 (1.15 - 76.03) - Fixed effect. 2
OR: 9.36 (1.15 - 75.85) - Random Effects. 2

1. OR Calculations (Taylor Series): DB Wilson. OpenEpi.
2. ORs & Meta-analyses carried out with R program:
Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Mantel-Haenszel odds ratio method: adding 0.5 to each cell if zero in one cell. This is not a good method to use with rare events study results according to The Cochrane Collaboration (2011), but only one part of the 2x2s is in the rare event category (0.0%). Nonetheless, the OR meta-analysis result - 9.37 - is close to the OR range (5 to 8) estimates given above and is therefore reasonable.


The Peto Method Odds Ratio calculations below provide an estimate of the odds ratio. Related individual study OR results and the results for the two studies combined via meta-analysis are close to the OR range (5 to 8) estimates given above, all being statistically significant or near-significant.


The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Peto Method Odds Ratios* & Related Meta-Analysis
Categories
Shaffer et al. Renaud et al.
Suicides
Controls
Suicides
Control
Homosexual: Yes
3 0
4 0
Homosexual: No
117
147 51
55
Peto Odds Ratio
(95% CI)
9.41 (0.96 - 92.33) 1
9.72 (0.99 - 95.62) 2
7.82 (1.07 -57.06) 1
7.82 (1.07 - 57.06) 2
Two Studies
Combined
Via Meta-Analysis

8.47 (1.89 - 37.92) 1
OR: 8.59 (1.92 - 38.48) - Fixed effect. 2
OR: 8.59 (1.92 - 38.48) - Random effects. 2

1. Peto OR Calculator: DJR Hutchon.
Meta-Analysis Calculator: DJR Hutchon.
2. ORs & Meta-analyses carried out with R program: Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* This may be the best method to use with rare events study results according to The Cochrane Collaboration (2011). However, the Peto OR method is recommended for incidences less than 1%, that applies for the 0%, but the other incidences are above 1%: 3 / 120 = 2.5% & 4 / 55 = 7.3% that would result in underestimates for the OR (See comment below table.). The OR meta-analysis result - 8.59 - is nonetheless close to the OR range (5 to 8) estimates given above and is therefore reasonable.

Peto Odds Ratio Method: Bradburn et al. (2007) describe how the Odd Ratios are estimated using the Peto method:  "The Peto one-step method [16: Yusuf et al, 1985] computes an approximation of the log-odds from the ratio of the efficient score to the Fisher information, both evaluated under the null hypothesis. These quantities are estimated, respectively, by the sum of the differences between the observed and expected numbers of events in the treatment arm and by the sum of the conditional hypergeometric variances." (p. 55) The method works well with low incidences (less than 1%), including zero events (n = 0) that are present in both the Shaffer and Renaud studies. However, it produces increasingly biased results (OR underestimates) when incidences are greater that 1%. For the Shaffer study, the underestimate would apply given the 2.50% incidence (3 / 120), with the same applying for the Renaud study: 7.27% (4 / 55).

The meta-analysis results for the Shaffer and Renaud studies, combined with the Peto OR results for each study, are reasonably good indicators that each study had results so nearly statistically significant or statistically significant - 9.72 (0.99 - 95.62) & 7.82 (1.07 - 57.06). This outcome is also suggested from the statistically significant results of the 2 studies combined via meta-analysis: 8.59 (1.92 - 38.48).  For more information on combining studies via meta-analytical procedures, see Cohn & Becker (2003) who summarized their findings in the paper's abstract:
Abstract: One of the most frequently cited reasons for conducting a meta-analysis is the increase in statistical power that it affords a reviewer. This article demonstrates that fixed-effects meta-analysis increases statistical power by reducing the standard error of the weighted average effect size (T̄.) and, in so doing, shrinks the confidence interval around T̄.. Small confidence intervals make it more likely for reviewers to detect nonzero population effects, thereby increasing statistical power. Smaller confidence intervals also represent increased precision of the estimated population effect size. Computational examples are provided for 3 effect-size indices: d (standardized mean difference), Pearson's r , and odds ratios. Random-effects meta-analyses also may show increased statistical power and a smaller standard error of the weighted average effect size. However, the authors demonstrate that increasing the number of studies in a random-effects meta-analysis does not always increase statistical power.
After presenting related issues, Cohn & Becker (2003) outlined the reason for their paper:
"Although several articles have addressed issues related to meta-analysis and statistical power (e.g., Hedges & Pigott, 2001; Hunter & Schmidt, 1990; Strube, 1985; Strube & Miller, 1986), no article has explained how meta-analysis increases the statistical power of tests of overall treatment effects and relationships. This article addresses this gap in explanation." (p. 246)
For meta-analysis criticisms and related evaluations, see Borenstein et al. (2009). Note that the criticisms of meta-analysis would not apply for the meta-analysis results reported on this webpage. The two studies being combined via meta-analysis used almost identical methodologies and the purpose of the meta-analyses is to report only the results for the two studies combined. The meta-analytic combinations of the two studies also always resulted in statistically significant differences. 


References

Aberson, Chris (2002a). Interpreting Null Results: Improving Presentation and Conclusions with Confidence intervals. JASNH (Journal of Articles in Support of the Null Hypothesis), 1(3): 36-42.  PDF Download.

Biau DJ, Jolles BM, Porcher R (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research, 468(3): 885-92. AbstractPDF Download.

Borenstein M, Hedges LV, Higgins JPT. Rothstein HR (2009). Criticisms of Meta-Analysis. In: Introduction to Meta-Analysis, John Wiley & Sons, Ltd, Chichester, UK. doi: 10.1002/9780470743386.ch43. Amazon. Book Contents. Reference and Cited Chapter Sections. Full Text: Chapter 43.

Brady, Tony (2005/2008). Reviewer's quick guide to common statistical errors in scientific papers [Elsevier advice to peer reviewers of scientific papers].
PDF Download. Full Text.

The Cochrane Collaboration (2011). Cochrane Handbook for Systematic Reviews of Interventions - Version 5.1.0. .  PDF Download. PDF Download.
From Section "16.9.5  Validity of methods of meta-analysis for rare events" (PDF Download):

"Bradburn et al. found that many of the most commonly used meta-analytical methods were biased when events were rare (Bradburn 2007).  The bias was greatest in inverse variance and DerSimonian and Laird odds ratio and risk difference methods, and the Mantel-Haenszel odds ratio method using a 0.5 zero-cell correction.  As already noted, risk difference meta-analytical methods tended to show conservative confidence interval coverage and low statistical power when risks of events were low.

At event rates below 1% the Peto one-step odds ratio method was found to be the least biased and most powerful method, and provided the best confidence interval coverage, provided there was no substantial imbalance between treatment and control group sizes within studies, and treatment effects were not exceptionally large. This finding was consistently observed across three different meta-analytical scenarios, and was also observed by Sweeting et al. (Sweeting 2004)...

Methods that should be avoided with rare events are the inverse-variance methods (including the DerSimonian and Laird random-effects method). These directly incorporate the study’s variance in the estimation of its contribution to the meta-analysis, but these are usually based on a large-sample variance approximation, which was not intended for use with rare events. The DerSimonian and Laird method is the only random-effects method commonly available in meta-analytic software.  We would suggest that incorporation of heterogeneity into an estimate of a treatment effect should be a secondary consideration when attempting to produce estimates of effects from sparse data – the primary concern is to discern whether there is any signal of an effect in the data."

From Section "9.4.4.2  Peto odds ratio method" (Full Text):

Peto’s method (Yusuf 1985) can only be used to pool odds ratios. It uses an inverse variance approach but utilizes an approximate method of estimating the log odds ratio, and uses different weights. An alternative way of viewing the Peto method is as a sum of ‘O – E’ statistics. Here, O is the observed number of events and E is an expected number of events in the experimental intervention group of each study. The approximation used in the computation of the log odds ratio works well when intervention effects are small (odds ratios are close to one), events are not particularly common and the studies have similar numbers in experimental and control groups. In other situations it has been shown to give biased answers. As these criteria are not always fulfilled, Peto’s method is not recommended as a default approach for meta-analysis.  Corrections for zero cell counts are not necessary when using Peto’s method. Perhaps for this reason, this method performs well when events are very rare (Bradburn 2007) (see Chapter 16, Section 16.9).

Cochrane Collaboration’s Open Learning Material for Cochrane reviewers
- From Section "Combining studies: Weighted Averages" (PDF Download):

The Peto method: The Peto method works for odds ratios only. Focus is placed on the observed number of events in the experimental intervention. We call this O for 'observed' number of events, and compare this with E, the 'expected' number of events. Hence an alternative name for this method is the 'O - E' method. The expected number is calculated using the overall event rate in both the experimental and control groups. Because of the way the Peto method calculates odds ratios, it is appropriate when trials have roughly equal number of participants in each group and treatment effects are small. Indeed, it was developed for use in mega-trials in cancer and heart disease where small effects are likely, yet very important. The Peto method is better than the other approaches at estimating odds ratios when there are lots of trials with no events in one or both arms. It is the best method to use with rare outcomes of this type. The Peto method is generally less useful in Cochrane reviews, where trials are often small and some treatment effects may be large.

Cohn LD, Becker BJ (2003). How meta-analysis increases statistical power. Psychological Methods, 8(3): 243-53. AbstractPDF Download.

Cook C (2010). Five per cent of the time it works 100 per cent of the time: the erroneousness of the P value. The Journal of Manual and Manipulative Therapy, 18(3): 123-5.
PDF Download.

Gelman A, Stern H (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 60(4): 328-331. PDF Download.

Goodman, Steven (2008). A Dirty Dozen: Twelve P-Value Misconceptions. Seminars in Hematology, 45: 135-140. 
PDF Download.

Goodman SN (1999). Toward evidence-based medical statistics. 1: The P Value Fallacy. Annals of Internal Medicine, 130: 995-1004. Abstract. PDF Download.

Hubbard R, Armstrong JS (2006). Why We Don't Really Know What Statistical Significance Means: Implications for Educators. Journal of Marketing Education, 28(2): 114-120. Abstract. PDF Download. PDF Download.

Hubbard R, Bayarri MJ (2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. The American Statistician, 57: 171–182. (With discussion). Abstract.

Hubbard R, Bayarri MJ (2003a). P Values are not Error Probabilities. Working Paper 03-26 ISDS, ISDS, Duke University. PDF Download.

Hubbard R, Lindsay RM (2008). Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing. Theory & Psychology, 18(1): 69–88. Abstract. PDF Download.
Excerpt: "The p value computes not the probability of the observed data under H0, but this plus the probability of more extreme data. This is a major weakness regarding the usefulness of p values. Because they are defined as a procedure for establishing the probability of an outcome, as well as more extreme ones, on a null hypothesis, significance tests are affected by how the probability distribution is spread over unobserved outcomes in the sample space. That is, the p value denotes not only the probability of what was observed, but also the probabilities of all the more extreme events that did not arise. How is it that these more extreme, unobserved, cases are involved in calculating the p value?" (p. 78)
Moran JL (2006). Statistical Issues in the Analysis of Outcomes in Critical Care Medicine. Dissertation for the Degree of Doctor of Medicine, Department of Intensive Care Medicine, University of Adelaide, Australia. Abstract & Download Page. PDF Download.
Moran JL, Solomon PJ (2004). Point of View: A Farewell to P-values? Critical Care and Resuscitation, 6: 130-137. PDF Download. Read online.

O'Brien, Kathleen (2010). A debate on the issue of suicide among gay youth. Full Text.

Panagiotakos, Demosthenes B (2008). The Value of p-Value in Biomedical Research. The Open Cardiovascular Medicine Journal, 2: 97-99. PDF Download. PDF Download.

Schlag, Karl H (2011). Exact Hypothesis Testing without Assumptions - New and Old Results not only for Experimental Game Theory. Conference presentation. PDF Download. PDF Download: 2010 Version.

Sellke T, Bayarri MJ, Berger JO (2001). Calibration of p Values for Testing Precise Null Hypotheses. The American Statistician, 55(1): 62-71.
Abstract. PDF Download.

Senn S (2001). Two cheers for P-values? Journal of Epidemiology and Biostatistics, 6(2): 193-204; discussion 205-10. PDF Download. Abstract: P-values are a practical success but a critical failure. Scientists the world over use them, but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous, but even the modern frequentist has little time for them. In this essay, I consider what, if anything, might be said in their favour.

Sestini P, Rossi S (2009). Exposing the P value fallacy to young residents. Presented at the 5th International Conference of Evidence-Based Health Care Teachers & Developers, Taormina, Italy, October 29, 2009.
PDF Download.

Stang A, Poole C, Kuss O (2010). The ongoing tyranny of statistical significance testing in biomedical research. European Journal of Epidemiology, 25(4): 225-30. Abstract. PDF Download.

Sterne JA (2002). Teaching hypothesis tests--time for significant change? Statistics in Medicine, 21(7): 985-94. Abstract. PDF Download.

Verhagen AP, Ostelo RWJG, Rademaker A (2004). Is the p value really so significant? Australian Journal of Physiotherapy, 50: 261- 262. PDF Download. 


"Statisticians are not to blame for the misconceptions in psychology about the use of statistical methods. They have warned us about the use of the hypothesis-testing models and the related concepts. In particular they have criticized the null hypothesis model and have recommended alternative procedures similar to those recommended here (See Savage, 1957; Tukey, 1954; and Yates, 1951)." (Nunnally, 1960: 649)

Nunnally, Jum (1960). The Place of Statistics in Psychology. Educational and Psychological Measurement, 20(4); 641-650. Abstract.

50 Years Later:

"Recommendation 2.17: Researchers must continue to move beyond a sole reliance on statistical significance in interpreting quantitative research in suicidology to address issues of the clinical and practical usefulness of their results." (
Rogers & Lester, 2010: Chapter 2: General Methodological Issues, p. 22)

Rogers JR, Lester D (2010). Understanding Suicide: Why We Don't and How We Might. Cambridge, MA: Hogrefe Publishing. Hogrefe Publishing. Amazon. Book Review.



Power Analyses: Absent From the Shaffer et al. (1995) & Renaud et al. (2010) Studies

A good summary of "Statistical Power" is given in Wikipedia:
"The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false (i.e. the probability of not committing a Type II error, or making a false negative decision [as it will be shown to apply in both the Shaffer et al. (1995) and Renaud et al. (2010) studies]). The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, the chances of a Type II error occurring decrease. The probability of a Type II error occurring is referred to as the false negative rate (β). Therefore power is equal to 1 − β, which is also known as the sensitivity.
Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis."
Given that power analyses were generally lacking in published medical research papers that required such analyses, Friedman et al. (2001) emphasized what was required by ending their paper with:
"A proper design of the study and appropriate statistical analysis are essential to the validity of all quantitative clinical research. A type-I error is better known than a type-II error and reviewers and readers are more cognisant of p values when authors conclude that significant differences between groups are found. Equal scrutiny is required when authors decide that there is no statistically significant difference.
In this age of limited resources and tight budgets, physicians may be forced to employ the cheapest methods, especially if the choices are thought to be similar. It is therefore important that investigators do not erroneously label two treatments as equivalent, when it has merely been shown that the differences were not statistically significant. All clinical studies should be based on appropriate calculations of sample size. The awareness of type-I error and the popularity of p values should be matched by equal cognisance of type-II error and β values. The practice of evidence-based medicine requires no less." (p. 401)
Unfortunately Power Analyses have been generally ignored by researchers in spite of repeated recommendations in many fields of study (Sedlmaier & Gigerenzer, 1989; Friedman et al. (2001); Jennionsa & Møller, 2003; Cashen & Geiger, 2004; Balkin & Sheperis, 2011; Lau & Kuk, 2011), essentially meaning that "ignorance" of a basic concept in statistics reigns supreme in the research world for both researchers and numerous peer reviewers evaluating papers submitted for publication.  This level of "high ignorance" especially applies with respect to Null Hypothesis Testing methods (statistical significance testing), and especially with low count samples and samples with rare events, including zero events in 2x2 cells. This issue, however, will be addressed later on this page.
  • Missing from both the Shaffer and Renaud studies is the requirement for a "A Power Analysis" that should be carried out before a study is proposed, as noted by Lau & Kuk, 2011: "Enough is enough: A primer on power analysis in study designs":
For researchers, knowing the optimal number of subjects to recruit to a study will provide more certainty in the conclusion without wasting unnecessary resources (such as time and money). For clinicians who review the research, knowing that the study was performed with an adequate number of subjects could instill additional confidence in the reported findings. Thus, it is important that both researchers and consumers of the research understand whether the study was conducted with an adequate sample of subjects. One way to choose an appropriate sample is to use power analysis." (p. 30)
"Power analysis can be done after the data has been collected in order to report the chance (P [= Power]) of detecting a true effect. This is known as a post-hoc power analysis. The American Psychological Association recommends that researchers include a post-hoc power analysis as a good practice when non-significant results are reported in the paper. This information can be used to improve the design of future independent studies that replicate the nonsignificant study. Unfortunately, it is uncommon to find post-hoc power analyses in publications." (p. 30)
"Another useful application of power analysis is to determine the minimal number of subjects in a research design prior to data collection. This is known as a priori power analysis. An increasing number of journal reviewers are asking for a priori power analyses to justify the sample size in a submitted paper, especially for those studies with a relatively small pool of subjects.
Post-hoc power analysis to determine the impact of treatment: When applying power analysis, four essential, interrelated parameters are required: sample size (N), effect size (ES) [Greatly ignored, see: Cohen (1992) & Alhija (2009)], significance criterion (α), and power (P). The sample size N denotes the number of subjects who take part in the experiment. The effect size is also known as the sensitivity of the test. ES can be conventionally expressed as different indices (with different notations) according to the statistical tests being employed. For example, the effect size index is d for a t-test with two independent groups...
The calculation of p-value can be referenced to any standard textbook on statistics. The significance criterion (α) in power analysis is the chance level for an error the researcher is willing to accept when the test shows a significant treatment effect [significant group difference], i.e., p < α. When α is set to 0.05, it means there is a 5% chance of making an error of believing that there is a true effect in the population when there is none.
In a post-hoc power analysis, the parameters N, ES, and α are typically known, and the calculation of P is of the most interest to the researchers and its intended audience. If P is found to be greater than 0.8 [80%], it is assumed that the treatment effect is considered practically significant and has an impact in the real world. If P is less than 0.8 [80%], the value of ES could help in estimating how many subjects will be needed to yield a P greater than 0.8. The post-hoc power analysis becomes an a priori power analysis." (pp. 30-31)
Aberson et al. (2002) summarized the Statistical Power misunderstandings and misconceptions in "An Interactive Tutorial for Teaching Statistical Power":
"Statistical power considerations are important to adequate research design (Cohen 1988). Without sufficient statistical power, data-based conclusions may be useless. Students and researchers often misunderstand factors relating to statistical power. One common misunderstanding is the relationship between Type II errors and Type I errors. Given a design with a 5% Type I error rate, students and researchers often predict the rate of Type II errors also to be 5% (Hunter 1997; Nickerson 2000). Of course, the probability of a Type II error is generally much greater than 5%, and in a given study, the probability of a Type II error is inversely related to the Type I error rate. Another misconception is the belief that failure to reject the null hypothesis is sufficient evidence that the null hypothesis is true (that is, failing to reject suggests that the null hypothesis is true; see Nickerson 2000 [this being "the misconception" dominating in the Shaffer and Renaud studies]). The prevalence of underpowered studies in many fields is striking evidence of a lack of comprehension of the relevance of statistical power to research design (for example, on average, the Type II error rate in psychology and education is estimated to be 50% or more; Sedlmeier and Gigerenzer 1989; Lipsey and Wilson 1993)."

Post Hoc Power Analysis (Fisher exact test) - Power = 0.35 (one-sided or two-sided) - Meaning high probability (0.65) for a Type II Error (as compared to the generally acceptable probability of 0.20). A Type II error happens when two groups are significantly different but that problems with the study do not make it possible to recognize this fact. [Calculator: Power (one-sided or two-sided Fisher exact test) = 0.03527. Same result with G*Power Program]

In their paper, the researchers could them have reported that the probability of a Type II error is high (0.65): that there is a significant difference between the groups but that the power to detect it was low (0.35). Furthermore, for future researchers seeking to replicate the study, they could have given the following sample size analysis based on their study results and the need for a power of 0.80.
  • Fisher-Test, p1 = .025 [3 / 120], p2 = 0.000 [0 / 147], alpha = 0.05, power = 0.80 (Acceptable Level):
  • One-sided Fisher Exact Test: n1 = n2 = 339 = Sample Size Required   [G*Power: n1 = n2 = 339] 
  • Two-sided Fisher Exact Test: n1 = n2 = 445 = Sample Size Required   [G*Power: n1 = n2 = 445]
The above information would also have precluded the study authors from making their monumental interpretation error: that their statistically non-significant p-value ( p = 0.90, one-sided or two-sided Fisher exact test), meaning that the Null Hypothesis is not rejected (a fact), interpreted as meaning that there is no difference between the two groups (a likely falsehood). Such a conclusion would only be plausible - but still not absolute - if the study had a power of 0.80.

Whereas Shaffer et al. could defend at this point that they could not know the effect size prior to carrying out the study and thus were not able to do an a-priori power analysis, this does not apply to Renaud et al.. The a priori power analysis for the Renaud study would have been the post hoc power analysis, including the sample size recommendation (detailed above) that the Shaffer et al. researchers should have included in their paper. Given that the information was not made available, it should have been generated by the Renaud et al researchers and reported. It would then have been important to explain, given the facts of the case, why they chose to only have 55 individuals in both the suicide group and the control group. At this point they would have had to generate another analysis, as based on the Shaffer et al. results, that would have reported on the expected power of the study they were to carry out.
  • Fisher Exact Test, p1 = .025, p2 = 0.000, alpha = 0.05, n 1 = n2 = 55 - Power = 0.0025 (two-sided), 0.0121 (one-sided). Therefore, the proposed study only had a 1.2% chance of detecting that the two groups were maybe different by using the Fisher exact one-sided test. [Calculator]
This information would have revealed, given that homosexually related questions to be used with informants were from the Fisher et al. study, that the results of the Shaffer et al. study were known, including the counts (3 / 120 vs 0 / 147) and the related statistical non-significance of the analysis (p = 0.90). This would have been embarrassing to acknowledge, as would have been these questions posed to the researchers:
  • Given that Shaffer et al. only had 3 homosexually oriented individuals in a sample of 120 adolescent suicides, how many could you expect from your sample of 55 adolescent suicides? One? Maybe two at best?
  • Given that Shaffer et al. had zero homosexually oriented individuals in a sample of 147 adolescents used for their control sample, was not the use of 55 adolescents in your control sample more like the assurance that there will be zero homosexually oriented adolescents in the control group?
  • Nonetheless, the study was carried out with the likely knowledge that homosexually oriented adolescent counts would be so low that publishing results would be precluded. As could have been expected, there were zero homosexually oriented adolescents in the control group. However, likely as totally unexpected, there were 4 homosexually oriented individuals in the adolescent suicide group, or 7.27% (4 / 55), compared to 2.5% (3 / 120) in the Shaffer at al. study (a 3-times increase). Plus, the statistical analysis produced Fisher statistical significance values of p = 0.118 (two-sided) and p = 0.059 (one-sided), thus indicating that the p-value was now much closer to the value required to reject the Null Hypothesis: p < 0.05. Nonetheless, Renaud et al. replicated the Fisher at al. conclusion and, therefore, also made the same error. That is, the equated statistical non-significance - meaning not rejecting the Null Hypothesis - was also assumed to mean that there were no difference between the groups with respect to the presence of homosexually oriented adolescents. They also failed to give a post hoc power analysis at the end of their paper. The result of such a power analysis would have been: For a two-sided Fisher exact test, their study only had a power of 0.20 for detecting a difference between the two groups, with the power being still very low - 0.36 - for a one-sided Fisher exact test.

Post Hoc Power Analysis (Fisher exact test) - Power = 0.37 (one-sided) and Power = 0.21 (two-sided) [Same result with G*Power Program] - Meaning high probability (0.64) for a Type II Error (as compared to the generally acceptable probability of 0.20). [Calculator: Power, Fisher exact test, one-sided (0.3708) or two-sided (0.2085)]

For future researchers seeking to replicate the study, they could then have given the following sample size analysis based on their study result and the need for a power of 0.80.
  • Fisher-Test, p1 = .0727 [4 / 55], p2 = 0.000 [0 / 55], alpha = 0.05, power = 0.80 (Acceptable Level):
  • One-sided Fisher Exact Test: n1 = n2 = 111 = Sample Size Required   [G*Power: n1 = n2 = 110] 
  • Two-sided Fisher Exact Test: n1 = n2 = 147 = Sample Size Required  [G*Power: n1 = n2 = 144]
The above would have also precluded the study authors from making their monumental interpretation error: that their statistically non-significant p-value (p = 0.59, one-sided Fisher exact test), meaning that the Null Hypothesis is not rejected (a fact), was interpreted as there is no difference between the two groups (a likely falsehood). Such a conclusion would only be possible - but still not absolute - if the study had a power of 0.80.
However, given the Shaffer et al. results, the Renaud et al. researchers could have recommended that future researchers use higher n's, to increase the probability that such a future study might have count indicating that the study power is close to 0.80. This numbers could have been about 200 for adolescent suicides and about 300 for the control sample.

The percentage for homosexually oriented adolescents in the two suicide group could be averaged ([2.5% + 7.3% = 9.8] / 2 = 4.9%), with the 5% proportion & 0% in control group then used for a sample size estimate using a power of 0.80.

Fisher-Test (One-Sided), p1 = .05, p2 = 0.000, alpha = 0.05, power = 0.80 (Acceptable Level), n2 = twice the size of n1:
Sample Size Required: "Assuming outcome data will be analyzed prospectively by Fisher's exact-test or with a continuity corrected chi-squared test and that all observations are independent": n1 = 128, n2 = 255 (Calculator). G*Power: n1 = 120, n2 = 240

Sample Size Required: With Continuity Correction: n1 = 127, n2 = 254 (Calculator)

Generated with "PS Power" Program: "We are planning a study of independent cases and controls with 2 control(s) per case.  Prior data indicate that the [homosexuality] rate among controls is 0.  If the true
[homosexuality] rate for experimental subjects is 0.049, we will need to study  127 experimental subjects and  254 control subjects to be able to reject the null hypothesis that the [homosexuality] rates for experimental and control subjects are equal with probability (power) 0.8. The Type I error probability associated with this test of this null hypothesis is 0.05. We will use a continuity-corrected chi-squared statistic or Fisher’s exact test to evaluate this null hypothesis." n1 = 127, n2 = 254 (Calculator Download)

Possible Outcomes, assuming a study with
n1 = 128, n2 = 256:
  • Possible Study Results: 5 / 128 (3.9%) vs 0 / 256 (0.0%) - Power = 0.88 [Power Calculator], Fisher One-/Two-Sided: 0.012 [Fisher Exact Tests]- Peto OR Estimate: 20.7 (3.2 - 134.4) [Peto OR Estimate Calculations: DJR Hutchon.]
  • Possible Study Results: 4 / 128 (3.1%) vs 0 / 256 (0.0%) - Power = 0.76, Fisher One-/Two-Sided: 0.012 - Peto OR Estimate: 20.6 (2.5 - 165.8)
  • Possible Study Results: 5 / 128 (3.9%) vs 1 / 256 (0.39%) - Power = 0.73, Fisher One-/Two-Sided: 0.017 - OR: 10.3 (1.2, 237.1) (Calculator for ORs & Fisher Exact Test]
  • Possible Study Results: 4 / 128 (3.1%) vs 1 / 256 (0.39%) - Power = 0.57, Fisher One-/Two-Sided: 0.044 - OR: 8.2 (0.9 - 199.4)
Note: The doubling of the living control sample, compared to study sample (as opposed to proposing the use of same-size samples), would be recommended because at least 3 counts could be expected in the studied suicide group (3 out of 120, or 4 out of 55); therefore making possible maybe 4 or 5 homosexually oriented adolescents in a group of 128 adolescent suicides). The larger control sample (n = 256) might produce at least one homosexually oriented adolescent. If not, the "0" value would nonetheless still produce the conclusion that homosexually oriented adolescents are at a much greater risk for suicide (An Odds Ratio > 8), as evidenced from the Odds Ratios that would results with only one homosexually oriented adolescents in the control group and 4 or 5 homosexually oriented adolescent suicides. Increasing the living control rate would also be a cost effective way to get more precise estimations.

See above. It must be noted, however, that using the (conditional) Fisher exact test was most inappropriate for the Shaffer and Renaud studies, given the low event counts and especially given the presence of zero counts in 2x2 cells ("0" homosexually oriented adolescents in both control groups). More appropriate methods - generally using unconditional tests, but also using other tests - will be explored below, the conclusion being that the Null Hypothesis should have been rejected in both studies, and that homosexually oriented adolescents have been at greater risk for suicide compared to their heterosexual counterparts. This conclusion is even more supported by the results of the two studies combined via meta-analysis. What now remains to be determined is the magnitude of their greater risk for suicide expressed either as a Risk Ratio or an Odds Ratio. Therefore, having sufficient event counts, especially in the control group, is of great importance to make such calculations possible. At worse, given the above "results" possibilities, with "0" counts in the control group, we could nonetheless conclude that the ORs would be at least 8 to 10, with one problem: wide confidence intervals.

References

Aberson CL, Berger DE, Healy MR, Romero VL (2002). An Interactive Tutorial for Teaching Statistical Power. Journal of Statistics Education, 10(3):  
Full Text. Note: Links given in the paper to the Power Tutorial are in error. See: Instructors' Notes for the Power Tutorial.
Abstract: This paper describes an interactive Web-based tutorial that supplements instruction on statistical power. This freely available tutorial provides several interactive exercises that guide students as they draw multiple samples from various populations and compare results for populations with differing parameters (for example, small standard deviation versus large standard deviation). The tutorial assignment includes diagnostic multiple-choice questions with feedback addressing misconceptions, and follow-up questions suitable for grading. The sampling exercises utilize an interactive Java applet that graphically demonstrates relationships between statistical power and effect size, null and alternative populations and sampling distributions, and Type I and II error rates. The applet allows students to manipulate the mean and standard deviation of populations, sample sizes, and Type I error rate.
Alhija FN-A (2009). Effect Size Reporting Practices in Published Articles. Educational and Psychological Measurement, 69(2): 245-265. Abstract.
Abstract: Effect size (ES) reporting practices in a sample of 10 educational research journals are examined in this study. Five of these journals explicitly require reporting ES and the other 5 have no such policy. Data were obtained from 99 articles published in the years 2003 and 2004, in which 183 statistical analyses were conducted. Findings indicate no major differences between the two types of journals in terms of ES reporting practices. Different conclusions could be reached based on interpreting ES versus p values. The discrepancy between conclusions based on statistical versus practical significance is frequently not reported, not interpreted, and mostly not discussed or resolved.
Balkin RS, Sheperis CJ (2011). Evaluating and reporting statistical power in counseling research. Journal of Counseling & Development, 9: 268-272. Abstract.
Abstract: Despite recommendations from the Publication Manual of the American Psychological Association (6th ed.) to include information on statistical power when publishing quantitative results, authors seldom include analysis or discussion of statistical power. The rationale for discussing statistical power is addressed, approaches to using G*Power to report statistical power are presented, and examples for reporting statistical power are provided. [The G*Power Program: Download Page. An easy-to-use java-based program for a post hoc analysis of power and using the Fisher exact one-sided or two-sided test: Download Page. Also available: The "PS Power" program: Download Page.]
Cashen LH, Geiger SW (2004). Statistical Power and the Testing of Null Hypotheses: A Review of Contemporary Management Research and Recommendations for Future Studies. Organuizational Research Methods, 7(2): 151-167. Abstract. Full Text.
Abstract: The purpose of this study is to determine how well contemporary management re-search fares on the issue of statistical power with regard to studies specifically predicting null relationships between phenomena of interest. This power assessment differs from traditional power studies because it focuses solely on studies that offered and tested null hypotheses. A sample of studies containing hypothesized null relationships was taken from five mainstream management journals over the 1990 to 1999 time period. Results of the power assessment suggest that management researchers’ abilities to affirm null hypotheses are low. On average, the power assessment revealed that for those studies that found nonsignificance of results and consequently affirmed their null hypotheses, the actual Type II error rate was nearly 15 times greater than what is advocated in the literature when failing to reject a false null hypothesis [emphasis mine]. Recommendations for researchers proposing and testing formal null hypotheses are also discussed.
Cohen, Jacob (1992). A Power Primer. Psychological Bulletin, 112(1): 155-159. Abstract. Full Text.
Abstract: One possible reason for the continued neglect of statistical power analysis in research in the behavioral sciences is the inaccessibility of or difficulty with the standard material. A convenient, although not comprehensive, presentation of required sample sizes is provided. Effect-size indexes and conventional values for these are given for operationally defined small, medium, and large effects. The sample sizes necessary for .80 power to detect effects at these levels are tabled for 8 standard statistical tests: (1) the difference between independent means, (2) the significance of a product–moment correlation, (3) the difference between independent r s, (4) the sign test, (5) the difference between independent proportions, (6) chi-square tests for goodness of fit and contingency tables, (7) 1-way analysis of variance (ANOVA), and (8) the significance of a multiple or multiple partial correlation.
Freedman KB, Back S, Bernstein J (2001). Sample size and statistical power of randomised, controlled trials in orthopaedics. The Journal of Bone and Joint Surgery. British Volume,  83(3): 397-402. Abstract. Full Text.
Lau C-C, Kuk F (2011). Enough is enough: A primer on power analysis in study designs. The Hearing Journal, 64(4): 30-39. Abstract / Download Page.

Jennionsa MD, Møller AP (2003). A survey of the statistical power of research in behavioral ecology and animal behavior. Behavioral Ecology, 14(3): 438-445.
Abstract / Download Page.
Abstract: We estimated the statistical power of the first and last statistical test presented in 697 papers from 10 behavioral journals. First tests had significantly greater statistical power and reported more significant results (smaller p values) than did last tests. This trend was consistent across journals, taxa, and the type of statistical test used. On average, statistical power was 13–16% to detect a small effect and 40–47% to detect a medium effect. This is far lower than the general recommendation of a power of 80%. By this criterion, only 2–3%, 13–21%, and 37–50% of the tests examined had the requisite power to detect a small, medium, or large effect, respectively [emphasis mine]...
Nickerson RS (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods, 5(2):241-301. Abstract.
Abstract: Null hypothesis significance testing (NHST) is arguably the most widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.
Sedlmeier P, Gigerenzer G (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105: 309-316. Abstract. Full Text.
Abstract: The long-term impact of studies of statistical power is investigated using J. Cohen's (1962) pioneering work as an example. We argue that the impact is nil; the power of studies in the same journal that Cohen reviewed (now the Journal of Abnormal Psychology) has not increased over the past 24 years. In 1960 the median power (i.e., the probability that a significant result will be obtained if there is a true effect) was .46 for a medium size effect, whereas in 1984 it was only .37. The decline of power is a result of alpha-adjusted procedures. Low power seems to go unnoticed: only 2 out of 64 experiments mentioned power, and it was never estimated. Nonsignificance was generally interpreted as confirmation of the null hypothesis (if this was the research hypothesis), although the median power was as low as .25 in these cases. We discuss reasons for the ongoing neglect of power.


"If the author does not obtain significant results in his/her study, the likelihood of being published is severely diminished due to the publication bias that exists for statistically significant results (Begg, 1994). As a result there may be literally thousands of studies with meaningful effect sizes that have been rejected for publication or never submitted for publication. These studies are lost because they do not pass muster with NHST." (Nix & Barnette, 1998: 5-6)

Nix TW, Barnette JJ (1998). The Data Analysis Dilemma: Ban or Abandon. A Review of Null Hypothesis Significance Testing. Research in the Schools, 5(2): 3-14.  Full Text.

Comment

Yet, both the Shaffer et al. (1995) & Renaud et al. (2010) were published even if the results
were not statistically significant. Was there maybe another form of "bias" that made these
studies so easy to be publish? Could it be that they were published because they both
blatantly accepted the null hypothesis, even if doing that violated protocol?
Why was this not detected by peer reviewers or the journal editors?




Effect Size Analysis: Independent Binomial 2x2s With Low Counts and One Zero Event

General Information on the importance of considering the Effect Size



There are many measures available for effect sizes (Cohen, 1988; Sánchez-Meca et al., 2003; Hojat & Zu, 2004, Livingston et al., 2009; Ferguson, 2009), but the effect size measure that seems to work best with low event counts and especially with zero in one cell is Cohen's h (Cohen, 1988) that is based on arcsine transformations. A separate section with references is available on meta-analyses using arcine differences. Furthermore, a separate webpage - "Generating Cohen's Effect Size "h" Via Arcsin / Arcsine Transformations" - was developed to better explain Cohen's h, the related equations & calculations, and using some tables made available in Cohen's 1988 book to estimate h values.


In the table below, the
Shaffer et al. (1995) and Renaud et al. (2010) basic study results are given, followed by the results for an "arcsine difference" (AS) based meta-analysis used to combine the two studies. Associated arcsine related effect sizes - Cohen's h values - are also given.


The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Arcsine Difference Meta-Analysis *

Categories
Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 3
No: 117
120
Yes: 0
No: 147
147
Renaud et al.
(2010)
Yes: 4
No: 51
55
Yes: 0
No: 55
55
Arcsine-Difference-Meta-Analysis-All-original
Categories
AS
 Arcsin Difference

Cohen's h: Effect Size
Arcsin Transformation Related

Formula
AS = arcsin √P1 - arcsin √P2 h = 2(arcsin √P1) - 2(arcsin √P2)
Shaffer et al.
(1995)
P1 = 3/120 = 0.025 -- P2 = 0/147 = 0
0.16 (0.04, 0.28)
One-/Two-Tailed p = 0.032, 0.057**
P1 = 3/120 = 0.025 -- P2 = 0/147 = 0
h = 0.32 (0.08 - 0.56)
Renaud et al.
(2010)
P1 = 4/55 = 0.0727 -- P2 = 0/55 = 0
0.27 (0.09, 0.46)
One-/Two-Tailed p = 0.021, 0.043**
P1 = 4/55 = 0.0727 -- P2 = 0/55 = 0
0.54 (0.18, 0.92)
Two Studies
Combined

0.19 (0.09 - 0.30)
h = 0.38 (0.18 - 0.60)
Effect Size Magnitude for h: 0.20 (Small) - 0.50 (Medium) - 0.80 (Large)

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: Meta-Analysis Using Arcsine Difference.
** Calculated with SMP Program, Version 2.1 http://www.ugr.es/~bioest/software.htm
Effect Size Magnitude for h: 0.20 (Small) - 0.50 (Medium) - 0.80 (Large)



The arcsin difference (AS) individual study results and the meta-analysis outcome suggest that both the Shaffer and Renaud studies produced statistically significant results. Furthermore, the effect sizes h ranged from small (0.32) to medium (0.54), thus indicating that the results have practical significance/importance.


Above, in the section on Power Analyses, it was noted that both the Shaffer and Renaud studies had low power to detect statistically significance differences, but calculations were based on the Conditional Fisher Exact Test that is known for being conservative, as in producing statistically nonsignificant results (e.g., 0.070) when maybe, if a more powerful test was used, the result might have been statistical significant (e.g., 0.040). This issue will be explored in the next section.

References

Champely, Stephane (2009). Basic functions for power analysis [Package ‘pwr’]. "R" Program. PDF Download.

Chen H, Cohen P, Chen S (2010). How Big is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies. Communications in Statistics - Simulation and Computation, 39: 860–864. Abstract.
Our calculations indicate that OR = 1.68, 3.47, and 6.71 are equivalent to Cohen’s d = 0.2 (small), 0.5 (medium), and 0.8 (large), respectively, when disease rate is 1% in the nonexposed group; Cohen’s d < 0.2 when OR < 1.5, and Cohen’s d > 0.8 when OR > 5. It would be useful to values with corresponding qualitative descriptors that estimate the strength of such associations; however, to date there is no consensus as to what those values of OR may be. Cohen (1988) suggested that d = 0.2, 0.5, and 0.8 are small, medium, and large on the basis of his experience as a statistician, but he also warned that these were only “rules of thumb.” Better guidelines are needed to draw conclusions about strength of associations in studies of risks for disease when we use OR as the index of effect size in epidemiological studies. (p. 864)
Coe, Robert (2002). It's the Effect Size, Stupid. What effect size is and why it is important. Paper presented at the Annual Conference of the British Educational Research Association, University of Exeter, England, 12-14 September 2002. Full Text.
Cohen, Jacob (1988). Statistical power analysis for the behavioral sciences. Second Edition. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Inc.. Google Books. Amazon. Excerpts on a Related Webpage.

Ferguson CJ (2009). An Effect Size Primer: A Guide for Clinicians and Researchers. Professional Psychology: Research and Practice, 40(5): 532–538. Abstract. PDF Download.

Fritz O, Morris PE, Richler JJ (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1): 2-18. Abstract. PDF Download.

Hojat M, Xu G (2004). A Visitor's Guide to Effect Sizes – Statistical Significance Versus Practical (Clinical) Importance of Research Findings. Advances in Health Sciences Education, 9(3): 241-249. Abstract.

Kelley, Ken (2012). On Effect Size. Psychological Methods. In Press.
PDF Download.

Kotrlik JW, Williams HA, Jabor MK (2011). Reporting and Interpreting Effect Size in Quantitative Agricultural Education Research. Journal of Agricultural Education, 52(1): 132–142. PDF Download.

Livingston EH, Elliot A, Hynan L, Cao J (2009). Effect size estimation: a necessary component of statistical analysis [Editorial]. Archives of Surgery, 144(8): 706-12. PubMed Reference. Full Text.

Nakagawa S, Cuthill IC (2007). Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews of the Cambridge  Philosophical Society, 82(4): 591-605.
Abstract. PDF Download.

Sánchez-Meca J, Marín-Martínez F, Chacón-Moscoso S (2003). Effect-size indices for dichotomized outcomes in meta-analysis. Psychological Methods, 8(4): 448-67.
Abstract. Full Text.

Schuele CM, Justice LM (2006, August 15). The Importance of Effect Sizes in the Interpretation of Research : Primer on Research: Part 3. The ASHA Leader. Full Text.

Sun S, Pan W, Wang LL (2010). A Comprehensive Review of Effect Size Reporting and Interpreting Practices in Academic Journals in Education and Psychology. Journal of Educational Psychology, 102(4): 989-1004. Abstract.
Full Text.

Texas Education Agency: Best Practice Clearing House (2012). How to Interpret Effect Sizes.
Full Text.

Thompson, Bruce (2002). “Statistical,” “Practical,” and “Clinical”: How Many Kinds of Significance Do Counselors Need to Consider? Journal of Counseling and Development, 80(1): 64-71. Abstract. Full Text.

Valentine JC, Cooper H (2003). Effect size substantive interpretation guidelines: Issues in the interpretation of effect sizes. Washington, DC: What Works Clearinghouse. PDF Download.


"Suicidologists have had a great difficulty in identifying meaningful correlates and predictors of suicidal behavior. Because of this Neuringer and Kolstoe (1966) suggested adopting less stringent criteria for statistical significance in suicide research, perhaps allowing rejection of the null hypothesis at the 10% level instead of the 5% level. This is an intriguing idea which has never been followed up, but it would result in the appearance of a larger proportion of "significant" results that were never replicated." (Rogers & Lester, 2010: Chapter 2: General Methodological Issues, p. 21)

Neuringer C, Kolstoe RH (1966). Suicide research and the nonrejection of the null hypothesis. Perceptual & Motor Skills, 22: 115-118. Summary & First Page Excerpt.


Rogers JR, Lester D (2010). Understanding Suicide: Why We Don't and How We Might. Cambridge, MA: Hogrefe Publishing. Hogrefe Publishing. Amazon. Book Review.

Comment

A Not Mentioned Possibility: Maybe the most commonly used statistical test was flawed
in some ways. Maybe more powerful tests were available to more accurately determine
whether or not the null hypothesis should be rejected?




Problems With The Widely Used Conditional Fisher Exact Test: The More Appropriate Use of Unconditional Tests and Mid-p Values
In the Shaffer et al. and Renaud et al. studies, the widespread Fisher Exact Test was used. In this section, we demonstrate that this test is unnecessarily conservative and that more appropriate tests are available, which would lead to a reversal of the conclusions made by the authors.
By 1990, an important fact related to the conditional Fisher Exact Test was highlighted in the Hirji et al. (1991) abstract: "The use of the Fisher exact test for comparing two independent binomial proportions has spawned an extensive controversy in the statistical literature. Many critics have faulted this test for being highly conservative."
By being "highly conservative," in a situation where the "conditional" Fisher Exact Test has produced a p = 0.070 statistical significance, for example, this would mean that the Null Hypothesis is not rejected, also meaning that the difference between the 2 compared groups is not statistically significant, but these conclusions might be in error. That is, if more appropriate statistical tests had been carried out - such as "unconditional exact tests" - or other tests for statistical significance that approximate unconditional test results such as Mid-p - most "p" results might have been in the "statistically significant" category. That is, the conservatism of the Fisher Exact test would have been removed as noted by Agresti & Gottard (2007). With respect to the Shaffer (1995) and Renaud (2010) studies, the one-sided Fisher Exact Test was used, but Lyderson et al. (2009) emphasized the following in both the abstract and conclusion of their paper:
"The traditional Fisher's exact test should practically never be used." The reason for this is given: "Unconditional tests preserve the significance level and generally are more powerful than [the conditional] Fisher's exact test for moderate to small samples, but previously were disadvantaged by being computationally demanding. This disadvantage is now moot, as software to facilitate unconditional tests has been available for years. Moreover, Fisher's exact test with mid-p adjustment gives about the same results as an unconditional test. Consequently, several better tests are available, and the choice of a test should depend only on its merits for the application involved. Unconditional tests and the mid-p approach ought to be used more than they now are." ... Elsewhere in the paper: "We consider exact unconditional tests to be the gold standard for testing association in 2×2 tables."
In other words, in many cases, and especially in studies with two small independent binomial samples, the conditional Fisher Exact Test is less powerful and less accurate or less exact for the determination of statistical significance when comparing the two samples. One concern with using the Mid-p value, however, has been that Mid-p statistically significant outcomes might produce more Type I errors compared to those occurring with the Fisher Exact Test (determinations of statistical significance when nonsignificance exists). Concerning this issue, researchers have been reporting that the problem is either minimal or that it does not exist (e.g. Hwang & Yang, 2001; Crans & Shuster, 2008; Parzen, 2009; Biddle & Morris, 2011). It is also noted that using Mid-p values reduce high Type II error rates produced by the Fisher Exact Test, which seems to have been the problem when the Fisher Exact Test was used in the Shaffer and Renaud study data. Concerning this issue, Biddle and Morris (2011) state:
"In recent years, there has been growing support in the statistical literature regarding an adjustment to the FET [Fisher exact test], namely Lancaster’s mid-P (LMP) test (Lancaster, 1961), which yields a better balance of Type I and Type II errors. The LMP test produces Type I error rates that are generally closer to the nominal significance level (typically set at .05) than the FET, which often shows Type I error rates substantially below the nominal level (Crans & Shuster, 2008). Consequently, the LMP test often has greater statistical power (lower Type II error rates) than the FET. The LMP test has been widely accepted in the medical and statistics fields..." (p. 956)
Fellows (2010) presents the best support to date for using mid-p values:
"Although the mid-p was developed over 50 years ago, practitioners have yet to adopt it into everyday practice. Because the mid-p can have Type I error greater than its nominal level, it could possibly be viewed as deceptive (Routledge, 1994). This view is understandable considering the lack of solid theoretical grounding. The mid-p has, with some exceptions (Lancaster, 1961; Hwang and Yang, 2001), been justified using heuristic devices (Barnard, 1989; Berry and Armitage, 1995) and numerical simulation (Hirji, 1991). While these justifications do provide some comfort that the mid-p is safe, one would be correct to be somewhat uneasy with its wholesale application to all problems. Under estimated truth framework, the mid-p has a solid claim to primacy. The minimax theorems developed in this article show that in the worst-case scenarios, the mid-p is the least risky p-value. These results apply to a surprisingly wide range of situations. In the case of a simple hypothesis test, the mid-p and the likelihood ratio ordering were found to be mutually supportive of one another. In the case of one-sided tests, the mid p-value was shown to dominate all other members of its family. The mid p-value has smaller maximum risk than any of the more conservative members of its family in the case of two sided tests, and dominates all of its family in some important special cases that pervade statistical practice." (p. 251)
It seems, however, like the unconditional Barnard's Exact Test (1945, 1947, Related Section of this Page) is the ultimate test for significance when comparing two independent binomials (Trujillo-Ortiz, 2004; Mehta & Senchaudhuri, 2003; Martin Andres & Silva Mato, 1994; Camilli, 1990). The major problem related to using the Barnard Exact Test has been the need for extensive computing, now resolved at least for small n's in 2x2 tables given our more powerful computers. Related computing time problems nonetheless continue to be an issue, depending on one's computer and increasing n's and table sizes.

Finally, Fishers Exact Test is designt for a probability model different from the one that is needed here. Fishers Exact Test corresponds tot he Tea-Tasting probability model, i.e, were both of the margins in a 2x2 table are fixed. This deviates from the sampling scheme involved in the Shaffer et al. and Renaud et al. studies, where sampling is from two independent binomials, i.e., only one margin of the 2x2 table is fixed (see e.g., Lyderson et al., 2009)
In the "Statistical Significance Tests" table, results from varied unconditional test, Mid-p analyses and other tests (e.g., Arcsine Difference) are given for both the Shaffer and Renaud studies. In the great majority of cases (and more so in one-sided tests), results generated from unconditional tests, for Mid-p values analyses and other tests indicate statistical significance for the difference between the two independent binomial samples in both studies (meaning that the Null Hypothesis is to be rejected). On this basis, it is therefore concluded that homosexually oriented adolescents are likely at greater risk for suicide compared to their heterosexual counterparts. Parzen (2009) reports that Mid-p values also closely approximate what would be the result of a Bayesian analysis, this being what was observed given the Bayesian analysis and meta-analysis of the two studies, as summarized below.


A Bayesian Analysis and Meta-Analysis of the Shaffer et al. (1995) and Renaud et al. (2010) Studies.
The meta-analysis results in this section are part of the Plöderl et al. (2013) paper published in the Archives of Sexual Behavior.


The Shaffer et al. (1995) & Renaud et al. (2010) Studies:
Suicide Risk & Adolescents Homosexuality
Study
Homosexuality As Suicide Risk Factor
Study
Information
Shaffer
et al.
(1995)
3 Suicides, All males = 3.2% of Males,
=2.5% of Sample.
Result Reported to be
Not Significant:
Fisher Exact Test, One-Sided:
p = 0.088
Psychological Autopsy: 170 New York City consecutive suicide victims - 1984-1986 - under the age of 20, 120 available for study, 95 males and 25 females. Not one female was deemed to be gay/bisexual, or having had sex with a female. - 3 males deemed to be homosexual; one having acknowledged his gay orientation, and 2 having been homosexually active.- 1 male, not classified to be homosexual, committed suicide with one of the homosexual male victims. Both were found dead holding hands. Classified only as a friend of gay teenagers. - 5 other males "known to be close friends of other gay teenagers." - 3 males "reported to have been effeminate in their behavior," were not classified to be homosexual, but the only gay suicide victim was effeminate. There were 83 males  without information about there sexual orientation, and they were assumed to be heterosexual. There may have been disclosed homosexual individuals among the 83 males, thus biasing the actual group difference downwards. . - 83 males with no information which could lead researchers to suspect they were gay/bisexual, or had homosexual experiences. Therefore considered heterosexual unless proven otherwise. - 147 male and female controls studied, obtained from random sampling of 196, with 49 having refused to participate in the study. There were 116 males in the control group. -- The method used to discover if a suicide victims were gay or bisexual and/or if they ever had same-gender sex consisted of asking: 1. "a parent or other adult member of the household in which the victim was living at the time of death;" - 2. "either a sibling or a friend from the victim's peer group nominated by the parent or caretaker;" - 3. "at least one school teacher (and, more usually three) nominated by the school principal as being well informed about the subject's classroom behavior". Caveat: Low Counts. Plus: With "0" individuals classified as 'homosexual' in control group, it is impossible to calculate and Odds Ratio. Therefore, control group should have been larger to make this possible.
Renaud
et al.
(2010)


4 Homosexual Suicides out of 55 Suicide Victims.
Result Reported to be
Not Significant (as above):
Fisher Exact Test, One-Sided:
p = 0.061

Canadian (Quebec) youth suicide victims (n = 55, 2000 to 2003, 11 to 18 years-old, 43 males, 12 females), 4 (3 males, 1 female) deemed to be homosexual. Control subjects (n = 55), none deemed to be homosexual. "Within the suicide victims, 4 people were found to have had a same-sex sexual experience, described themselves as having same-sex sexual orientation, or expressed concern regarding their sexual orientation (3 males and 1 female)." Caveat: Low Counts. Plus: With "0" individuals classified as 'homosexual' in control group, it is impossible to calculate and Odds Ratio. Therefore, control group should have been larger to make this possible.
Combining the Shaffer et al. (1995) & Renaud et al. (2010) Studies

The two studies meta-combined and the results from Bayesian analyses produce statistically significant group differences. Results reported in Plöderl et al. (2013): Figure 1, Below.

Statistical Calculations for
Fisher Exact Test, One Sided, Carried Out At: http://statpages.org/ctab2x2.html . For other Statistical Tests of Significance for both studies, See Table Below.



Plöderl M, Wagenmakers EJ, Tremblay P, Ramsay R, Kralovec K, Fartacek C, Fartacek R (2013). Suicide Risk and Sexual Orientation: A Critical Review. Archives of Sexual Behavior, 42(5): 715-727. PubMed Abstract.

Ploderl-2013-Figure-1
From: Plöderl et al. (2013)

Note: The combined Shaffer & Renaud study results are used in a book chapter by Wetzels et al. (in press) to illustrate how a Bayesian analysis proceeds.

Wetzels, R., van Ravenzwaaij, D., & Wagenmakers, E.-J. (in press). Bayesian analysis. In R. Cautin, & S. Lilienfeld (Eds.), The Encyclopedia of Clinical Psychology. Wiley-Blackwell. Reference. PDF Download.



Statistical Significance Testing

Given the statistical significance results tabulated below for one-sided tests, and that unconditional tests are more appropriate to use with low-count 2 x 2 tables with rare events, the Null Hypothesis should be rejected in both studies. That is, we can conclude that homosexually oriented adolescents are therefore at greater risk for suicide compared to other adolescents, but the magnitude of this difference remains unknown, with the best OR estimate available being derived by using the Peto method: Shaffer study:
9.72 (0.99 - 95.62) & Renaud study: 7.82 (1.07 - 57.06). Information related to conditional and unconditional statistical significance tests is given below the table.



Statistical Significance / p-Value Tests
Shaffer et al. (1995) & Renaud et al. (2010) Studies

Statistical Significance Tests for Shaffer and Renaud Studies

1Calculated with function fisher.test in R, similar results with http://www.quantitativeskills.com/sisa/statistics/fisher.htm : The Fisher exact test.
2Calculated in R with the function given here: http://www.r-statistics.com/wp-content/uploads/2010/02/Barnard.R.txt.
3Calculated with http://www.quantitativeskills.com/sisa/statistics/fisher.htm :
The Fisher exact test.
4Calculated with WinPepi 11.17 - Download at http://www.brixtonhealth.com/.
5Only recommended if cell frequencies not less than 1
6Calculated with SMP Program,
Version 2.1 http://www.ugr.es/~bioest/software.htm
7Calculated with http://www.stat.ncsu.edu/exact/ : Berger (1996-2005).
8Confidence Interval set to 0.999, as recommended.
9Calculated with the Barnard package in R, resolution factor for the nuisance parameter set to dp = 0.0001
10According to Martín Andrés (personal communication, January, 2012), there is no other software solution to run the original Barnard test.
Note: Log-Likelihood χ2 can also be calculated at http://www.quantitativeskills.com/sisa/statistics/fisher.htm : 4.84 (p = 0.0278) for Fisher et al. (1995), and 5.696 (p = 0.017) for Renaud et al. (2010).
Significance Tests Calculated by Plõderl & Tremblay


References, Excerpts & Abstracts: A Basic Education on Binomial Null Hypothesis Statistical Testing


A Good Summary of Recommended Statistical Significance Tests for 2X2 Tables:
  • Lydersen S, Fagerland MW, Laake P (2009). Tutorial in biostatistics: Recommended tests for association in 2×2 tables. Statistics in Medicine, 28(7): 1159-1175. Abstract. PDF Download.

Abstract: The asymptotic Pearson's chi-squared test and Fisher's exact test have long been the most used for testing association in 2X2 tables. Unconditional tests preserve the significance level and generally are more powerful than Fisher's exact test for moderate to small samples, but previously were disadvantaged by being computationally demanding. This disadvantage is now moot, as software to facilitate unconditional tests has been available for years. Moreover, Fisher's exact test with mid-p adjustment gives about the same results as an unconditional test. Consequently, several better tests are available, and the choice of a test should depend only on its merits for the application involved. Unconditional tests and the mid-p approach ought to be used more than they now are. The traditional Fisher's exact test should practically never be used (Emphasis added).

Recommendation
s: Exact tests have the important property of always preserving test size. Our general recommendation is not to condition on any marginals not fixed by design. In practice, this means that an exact unconditional test is ideal. Pearson’s chi-squared (z-pooled) statistic or Fisher–Boschloo’s statistic works well with an exact unconditional test. Further, such a test can be approximated by an exact conditional mid-p test or, in large samples, see, for example, the traditional asymptotic Pearson’s chi-squared test. However, when an exact test is chosen, an unconditional test is clearly recommended. The traditional Fisher’s exact test should practically never be used
(p. 1174, Emphasis added).


More on Statistical Significance Test Recommendations for 2X2 Tables and Two Independent Binomial Proportions
:

  • Mehrotra DV, Chan IS, Berger RL (2003). A cautionary note on exact unconditional inference for a difference between two independent binomial proportions. Biometrics, 59(2): 441-50. PDF Download. Abstract.

Abstract: Fisher's exact test for comparing response proportions in a randomized experiment can be overly conservative [Many Type II Errors] when the group sizes are small or when the response proportions are close to zero or one. This is primarily because the null distribution of the test statistic becomes too discrete, a partial consequence of the inference being conditional on the total number of responders. Accordingly, exact unconditional procedures have gained in popularity, on the premise that power will increase because the null distribution of the test statistic will presumably be less discrete. However, we caution researchers that a poor choice of test statistic for exact unconditional inference can actually result in a substantially less powerful analysis than Fisher's conditional test. To illustrate, we study a real example and provide exact test size and power results for several competing tests, for both balanced and unbalanced designs. Our results reveal that Fisher's test generally outperforms exact unconditional tests based on using as the test statistic either the observed difference in proportions, or the observed difference divided by its estimated standard error under the alternative hypothesis, the latter for unbalanced designs only. On the other hand, the exact unconditional test based on the observed difference divided by its estimated standard error under the null hypothesis (score statistic) outperforms Fisher's test, and is recommended. Boschloo's test, in which the p-value from Fisher's test is used as the test statistic in an exact unconditional test, is uniformly more powerful than Fisher's test, and is also recommended.

"To circumvent the conservatism of exact conditional inference based on Fisher's test, Barnard (1945, 1947) proposed the use of exact unconditional inference, based on elimination of the nuisance parameter by maximization. However, invoking Fisher's principle of ancillarity (see Basu, 1977; Little, 1989), Barnard (1949) subsequently renounced his unconditional test. Despite his renouncement and other publications in support of Fisher's test (Yates, 1984; Barnard, 1989, Upton, 1992), exact unconditional inference has gained in popularity over the last few decades (Berkson, 1978; Kempthorne, 1979; Santner and Snell, 1980; Upton, 1982; Suissa and Shuster, 1985; Haber, 1986; D'Agostino, Chase, and Belanger, 1988; Rice, 1988; Haviland, 1990; Storer and Kim, 1990; Andres and Mato, 1994)." (p. 441)

Basu D (1977). On the elimination of nuisance parameters. Journal of the American Statistical Association, 72: 355-366. Summary & Page 355.
Berkson J (1978). In dispraise of the exact test: Do the marginal totals of the 2X2 table contain relevant information respecting the table proportions? Journal of Statistical Planning and Inference, 2(1): 27-42. Abstract.
D'Agostino RB, Chase W, Belanger A (1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations. American Statistician, 42: 198-202.
Summary & Page 198.
Haviland MG (1990). Yate's correction for continuity and the analysis of 2 x 2 contingency tables (with comments). Statistics in Medicine, 9: 363-383. Abstract.
Kempthorne O (1979). In dispraise of the exact test: reactions. Journal of Statistical Planning and Inference, 3: 199-213.
Little RJA (1989). Testing the equality of two independent binomial proportions. American Statistician, 43: 283-288.
Summary & Page 283.
Rice WR (1988). A new probability model for determining exact p-values for contingency tables when comparing binomial proportions. Biometrics, 44: 1-22.
Abstract. PDF Download. Comment & Author Reply. Comment & Author Reply.
Santner TJ, Snell MK (1980). Small-Sample Confidence Intervals for p1 - p2 and p1/p2 in 2 × 2 Contingency Tables. Journal of the American Statistical Association, 75, No. 370: 386-394.
Summary & Page 386.
Storer BE, Kim C (1990). Exact properties of some exact test statistics for comparing two binomial proportions. Journal of the American Statistical Association, 85: 146-155. Summary & Page 146.
Upton G (1982). A Comparison of Alternative Tests for the 2 x 2 Comparative Trial. Journal of the Royal Statistical Society, Series A, 145: 86-105.
Summary & Page 86.
Yates F (1984). Tests of Significance for 2 × 2 Contingency Tables. Journal of the Royal Statistical Society. Series A, 147(3): 426-463. Summary & Page 426.



A Good Explanation of Why the Barnard Exact Test Should be Used Instead of the Fisher Exact Test When Comparing Results of Two Small Independent Binomial Samples:

  • Mehta CR, Senchaudhuri P (2003). Conditional versus Unconditional Exact Tests for Comparing Two Binomials.  PDF Download. Download Page.
The authors highlight, also with the use of graphics, that the Fisher test is "conditional," meaning that "the sample space for Fisher’s exact test much more discrete than [less than] it is for [the unconditional] Barnard’s exact test. Consequently, the number of distinct p-values that one could obtain with Fisher’s exact test is less than the corresponding number of distinct p-values that one could obtain with Barnard’s exact test [shown graphically]. This in turn implies that if we want to restrict the type-1 error to some upper limit, say 5%, Fisher’s procedure will usually be more conservative than Barnard’s, resulting in a loss of power [For the example given, one-tailed Fisher exact p-value = 0.0641, Bernard's test p-value = 0.0341]. The power loss diminishes as the sample sizes get larger since the discreteness of the Fisher statistic is not as pronounced."

The example given: 15 subjects receive a vaccine, 7 become infected, 8 remain not infected... versus ... 15 subjects receive placebo 'vaccine', 12 become infected, 3 remain not infected.

Related Calculations
:
At SISA: the one-sided Fisher exact test: p = 0.064068 = 0.0641 --- Mid-p for one-sided Fisher exact test: p = 0.0372689 = 0.0373.

At "Exact Unconditional Homogeneity/Independence Tests for 2X2 Tables": One-sided Fisher's exact conditional p-value = 0.0641. - One-sided Fisher-Boschloo, Confidence Interval Method (p = 0.0351), No Confidence interval Method (p = 0.0341). Calculations for other p-value estimates possible.


Using Downloaded SMP Program, Version 2.1 http://www.ugr.es/~bioest/software.htm: Mid-p for one-sided Fisher exact test: p = 0.03411. Unconditional Arc Sine Statistic: p = 0.03411. - - Barnard`s Test, real p-Value = 0.03669. - - Barnard`s Test, estimated p-Value = 0.03669

Barnard's Test for Two Independent Binomials & Mid-P Values:

  • Silva Mato S, Martín Andrés A (2000). Software SMP, Version 2.1. http://www.ugr.es/~bioest/software.htm - http://www.ugr.es/~bioest/smp.htm. According to Martín Andrés (personal communication, January, 2012), there is no other software solution to run the original Barnard test. The program generates results for the unconditional Bernard Test: Real-p & Estimated p. Also for: z-pooled statistic, Fisher one-sided, Fisher mid-p, and the unconditional arc sine statistic. References given:
Martín Andrés A (1991). A review of classic non-asymptotic methods for comparing two proportions by means of independent sampLES. Communications in Statistics - Simulation and Computation, 20(2/3): 551-583. Abstract.
Martín Andrés A (1997). Entry 'Fisher's exact and Barnard's tests'. Encyclopedia of Statistical Sciences. Update Volume 2, 250-8. Ed.: Kotz, Johnson and Read. Wiley-Interscience.
Martín Andrés A, Herranz Tejedor I (1995). Is Fisher's exact test very conservative. Computational Statistics and Data Analysis, 19, 579-591. Abstract. PDF Download.
Martín Andrés A, Silva Mato A (1994). Choosing the optimal unconditioned test for comparing two independent proportions. Computational Statistics and Data Analysis, 17(5): 555-574. Abstract.
Martín Andrés A, Sánchez Quevedo MJ, Silva Mato A (1998). Fisher's mid-p-value arrangement in 2x2 comparative trials. Computational Statistics and Data Analysis, 29(1), 107-115. Abstract.
Silva Mato A, Martín Andrés A (1995). Optimal unconditional tables for comparing two independent proportions. Biometrical Journal, 37(7), 821-836. Abstract.
Silva Mato A, Martín Andrés A (1997). Simplifying the calculation of the P-value for Barnard's test and its derivatives. Statistics and Computing, 7(2): 137-143. Abstract.
Silva Mato A, Martín Andrés A (1997). SMP.EXE in http://www.jiscmail.ac.uk/files/EXACT-STATS.
  • Barnard, George A (1945). A new test for 2×2 tables. Nature, 156 (No. 3954): 177. Reference. Response to RA Fisher: Nature, 156 (No. 3974):  783-784. Reference. Letters between RA Fisher & GA Barnard (1945-1962): PDF Download.
  • Barnard, George A (1947). Significance tests for 2 × 2 tables. Biometrika, 34(1/2): 123–138. Reference & Introduction. Letters between RA Fisher & GA Barnard (1945-1962): PDF Download.
  • Barnard, George A (1949). Statistical Inference. Journal of the Royal Statistical Society B, 11(2): 115–139. Reference & Introduction. Letters between RA Fisher & GA Barnard (1945-1962): PDF Download.
  • Trujillo-Ortiz, Antonio  (2004). Barnardextest: Barnard's Exact Probability Test. MATLAB CENTRAL. Full Text.
'This file, as the Fisher's exact test, performs the exact probability test for a table of frequency data cross-classified according to two categorical variables, each of which has two levels or subcategories (2x2). It is a non-parametric statistical test used to determine if there are nonrandom associations between the two categorical variables. Barnard's exact test is used to calculate an exact P-value with small number of expected frequencies, for which the Chi-square test is not appropriate (in case the total number of observations is less than 20 or the number of frequency cells are less than 5). The test was proposed by G. A. Barnard in two papers (1945 and 1947). While Barnard's test seems like a natural test to consider, it's not at all commonly used. This probably due that it is a little unknown. Perhaps due to its computational difficulty it is not widely used until recently, where the computers make it feasible. It is considering that the Barnard's exact test is more powerful than the Fisher's one..." [Note: This is likely not the true Barnard Test. Even StatXact had claimed to offer the Bernard Test that is described as follows by Lyderson et al. (2009): "Barnard’s unconditional test [20] uses a more computationally intensive algorithm for building a rejection region, and is, to our knowledge not included in any available software. StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in StatXact)." (p. 1166). Cytel (2008: http://www.cytel.com/software/StatXact.aspx) reports that "StatXact®9 [is] The Most Popular Exact Statistics Analysis Software... Only StatXact® has: ... exact power and sample size for comparing two binomials by Barnard's unconditional exact test (more powerful than Fisher's test)." See related Information by Martín Andrés (2012) at Silva Mato & Martín Andrés (2000).]
  • Lydersen S, Fagerland MW, Laake P (2009). Tutorial in biostatistics: Recommended tests for association in 2×2 tables. Statistics in Medicine, 28(7): 1159-1175. Abstract. PDF Download.
"Andres et al. [34] compared 15 test statistics, including the original test by Barnard. Barnard’s test and a simplified version of Barnard’s test have highest power, but are considered too computer intensive for practical use. Among the others, Pearson’s chi-squared and Fisher–Boschloo have power nearly as high as the optimal Barnard’s test [34]."
  • Cardillo, Giuseppe (2009/2010). MyBernard: A very compact and fast routine for Barnard's exact test on 2x2 matrix. MATLAB CENTRAL. Full Text.
"There are two fundamentally different exact tests for comparing the equality of two binomial probabilities – Fisher’s exact test (Fisher, 1925), and Barnard’s exact test (Barnard, 1945). Fisher’s exact test (Fisher, 1925) is the more popular of the two. In fact, Fisher was bitterly critical of Barnard’s proposal for esoteric reasons that we will not go into here. For 2 × 2 tables, Barnard’s test is more powerful than Fisher’s, as Barnard noted in his 1945 paper, much to Fisher’s chagrin..." [Note: This is likely not the true Barnard Test. Even StatXact had claimed to offer the Bernard Test that is described as follows by Lyderson et al. (2009): "Barnard’s unconditional test [20] uses a more computationally intensive algorithm for building a rejection region, and is, to our knowledge not included in any available software. StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in StatXact)." (p. 1166). Cytel (2008: http://www.cytel.com/software/StatXact.aspx) reports that "StatXact®9 [is] The Most Popular Exact Statistics Analysis Software... Only StatXact® has: ... exact power and sample size for comparing two binomials by Barnard's unconditional exact test (more powerful than Fisher's test)." See related Information by Martín Andrés (2012) at Silva Mato & Martín Andrés (2000).]
  • Hasselblad V, Lokhnygina Y (2007). Tests for 2 × 2 Tables in Clinical Trials. Journal of Modern Applied Statistical Methods, 6(2): 456-468. PDF Download.
Abstract: Five standard tests are compared: chi-squared, Fisher's exact, Yates’ correction, Fisher’s exact mid-p, and Barnard’s. Yates’ is always inferior to Fisher’s exact. Fisher’s exact is so conservative that one should look for alternatives. For certain sample sizes, Fisher’s mid-p or Barnard’s test maintain the nominal alpha and have superior power. [Note: To generate the Barnard Test result, it is likely that the referenced Cytel's StatXact 7 (2005) program was used. From Lyderson et al. (2009): "Barnard’s unconditional test [20] uses a more computationally intensive algorithm for building a rejection region, and is, to our knowledge not included in any available software. StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in StatXact)." (p. 1166).]
  • Martín Andrés A, Sánchez Quevedo MJ, Silva Mato A (2002). Asymptotical tests in 2x2 comparative trials (unconditional approach). Computational Statistics and Data Analysis, 40(2), 339-354. PDF Download.
Abstract: The unconditional Barnard’s test for the comparison of two independent proportions is difficult to apply even with moderately large samples. The alternative is to use a  χ2 type, arc sine or mid-p asymptotic test. In the paper, the authors evaluate some 60 of these tests, some new and others that are already familiar. For the ordinary significances, the optimal tests are the arc sine methods (with the improvement proposed by Anscombe), the  χ2 ones given by Pearson (with a correction for continuity of 2 or of 1 depending on whether the sample sizes are equal or different) and the mid-p-value ones given by Fisher (using the criterion proposed by Armitage, when applied as a two-tailed test). For one-(two) tailed tests, the first method generally produces reliable results E > 10.5 (E > 9 and unbalanced samples), the second method does so for E > 9 (E > 6) and the third does so for all cases, although for E <= 6 (E <= 10.5) it usually gives too many conservative results. E refers to the minimum expected quantity.

"The use of Fisher’s mid-p-value as an approximation to the Barnard test is quite surprising and should be justified. Given that the Fisher exact test is very conservative (compared to Barnard’s test), Plackett, in his discussion of Yates (1984), proposed Fisher’s mid-p-value as a means of reducing its conservatism. The idea was favourably received by Barnard (1989), Routledge (1992), Upton (1992) and Agresti (2001) because it was a way of terminating the conditional vs. unconditional argument (Haber, 1992). Haber (1986) was the first to propose mid-p as an approximation to the unconditional test, one that was described by Hirji et al. (1991) as a quasi-exact test. Both the authors and Davis (1993) agree that mid-p is generally conservative, but quite less so than Fisher’s exact test, and behaves in a very similar fashion to the χ2 test without c.c. Note that although one needs to use a computer to apply the mid-p, actually obtaining it presents no problem (no matter what the value of ni may be)." (p. 341)
  • Röhmel J, Mansmann U (1999). Unconditional non-asymptotic one-sided tests for independent binomial proportions when the interest lies in showing non-inferiority and/or superiority. Biometrical Journal, 41: 149–170. Abstract.
Abstract: For two independent binomial proportions Barnard (1947) has introduced a method to construct a non-asymptotic unconditional test by maximisation of the probabilities over the ‘classical’ null hypothesis H0= {(θ1, θ2) ∈ [0, 1]2: θ1 = θ2}. It is shown that this method is also useful when studying test problems for different null hypotheses such as, for example, shifted null hypotheses of the form H0 = {(θ1, θ2) ∈ [0, 1]2: θ2 ≤ θ1 ± Δ } for non-inferiority and 1-sided superiority problems (including the classical null hypothesis with a 1-sided alternative hypothesis). We will derive some results for the more general ‘shifted’ null hypotheses of the form H0 = {(θ1, θ2) ∈ [0, 1]2: θ2g1 )} where g is a non decreasing curvilinear function of θ1. Two examples for such null hypotheses in the regulatory setting are given. It is shown that the usual asymptotic approximations by the normal distribution may be quite unreliable. Non-asymptotic unconditional tests (and the corresponding p-values) may, therefore, be an alternative, particularly because the effort to compute non-asymptotic unconditional p-values for such more complex situations does not increase as compared to the classical situation. For ‘classical’ null hypotheses it is known that the number of possible p-values derived by the unconditional method is very large, albeit finite, and the same is true for the null hypotheses studied in this paper. In most of the situations investigated it becomes obvious that Barnard's CSM test (1947) when adapted to the respective null space is again a very powerful test. A theorem is provided which in addition to allowing fast algorithms to compute unconditional non-asymptotical p-values fills a methodological gap in the calculation of exact unconditional p-values as it is implemented, for example, in StatXact 3 for Windows (1995).
  • Silva Mato A, Martín Andrés A (1997). Simplifying the calculation of the P-value for Barnard's test and its derivatives. Statistics and Computing, 7(2): 137-143. Abstract.
Abstract: Unconditional non-asymptotic methods for comparing two independent binomial proportions have the drawback that they take a rather long time to compute. This problem is especially acute in the most powerful version of the method (Barnard, 1947). Thus, despite being the version which originated the method, it has hardly ever been used. This paper presents various properties which allow the computation time to be drastically reduced, thus enabling one to use not only the more traditional and simple versions given by McDonald et al. (1977) and Garside and Mack (1967), but also the more complex original version of Barnard (1947).
"The objective is to test H0 : p1 = p2 (= p), for which there are two competing methodologies: the conditional one (Fisher, 1935) and the unconditional one (Barnard, 1947). The argument about which methodology is more appropriate is nearly as old as modern statistics, and we will not deal with it here; the interested reader is referred to the reviews by Martín Andrés (1991), Richardson (1994) and Sahai and Khurshid (1995). For our purposes, it suffices to point out that the unconditional method is usually defended (see, for example, McDonald et al., 1977; Liddell, 1976; Haber, 1987) on the grounds that it is more powerful than the conditional method. However, it is computationally much more complex to implement." (p. 137)
Fisher RA (1935). The logic of inductive inference. Journal of the Royal Statistical Society A, 98: 39-54. Abstract. PDF Download. Letters between RA Fisher & GA Barnard (1945-1962): PDF Download.
Fisher R (1955). Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society. Series B, 17(1): 69-78. Abstract. PDF Download.
Garside GR, Mack C (1967). Correct confidence limits for the 2 x 2 homogeneity contingency table with small frequencies. New Journal of Statistics and Operations Research, 3(2): 1-25.
Haber M (1987). A comparison of some conditional and unconditional exact tests for 2 x 2 contingency tables. Communications in Statistics - Simulation and Computing, 16(4): 999-1013. Abstract.
Liddell D. (1976). Practical test of 2 x 2 tables. Statistician, 25(4): 295-304.
Abstract.
McDonald LL, Davis BM, Milliken GA (1977). A non-randomized unconditional test for comparing two proportions in a 2 x 2 contingency table. Technometrics, 19: 145-150. Abstract.
Richardson JTE (1994). The analysis of 2 x 1 and 2 x 2 contingency tables: an historical review. Statistical Methods in Medical Research, 3: 107-133. Abstract.
Sahai H, Khurshid A (1995). On analysis of epidemiological data involving a 2 x 2 contingency table: an overview of Fisher's exact test and Yates' correction for continuity. Journal of Biopharmaceutical Statistics, 5(1): 43-70. Abstract.

  • Martín Andrés A, Silva Mato A (1994). Computational Statistics and Data Analysis, 17(5): 555-574. Abstract.

Abstract: The most powerful non-asymptotic unconditioned method for comparing two proportions (independent samples) is that of Barnard (1945), but the complexity of computation has led to several simplifying versions being produced. There is no complete global comparison of these (though there are some partial ones, such as Haber's, 1987), nor has there been an evaluation of the loss incurred by not using Barnard's method. In this paper all existing relevant versions (including Barnard's, and one proposed by the authors) for a wide range of sample sizes are compared, as well as for one-and two-tailed tests (only the second case has been dealt with in recent literature), and a conclusion is drawn about the suitability of the new method proposed. The comparison is effected on the basis of the new criterion of “mean power”, and the other customary criteria for comparing methods, based on the comparison of their powers in each point of the parametric space, are criticized. The new criterion can be applied to all tests based on discrete random variables. Finally, given the large number of methods proposed in the relevant literature for solving this problem, the authors classify the same in function of their precision and their complexity of computation.

  • Camilli, Gregory (1990). The test of homogeneity for 2x2 contingency tables: A review of and some personal opinions on the controversy. Psychological Bulletin, 108(1): 135-145. Abstract.

Abstract: The 2×2 table has received an enormous amount of attention in the research literature. Most studies have focused on Type I error rates and the power of the chi-square statistic, but some have been more concerned with the theoretical justification behind methods of analysis. Little consensus has been achieved in either area. The reason for this is that 2 basic inferential paradigms that underlie much of the work in 2×2 tables are incompatible. Thus, empirical studies of Type I error rates of the chi-square test within the Neyman–Pearson framework are considered irrelevant by advocates of R. A. Fisher's exact test. Both approaches are described in this article. G. A. Barnard's (1947) test is shown to be theoretically superior to the chi-square test and all of its corrected cousins. However, Fisher's exact test is advocated as the most rational choice.


Papers Related to Using Mid-P values:

  • Lancaster HO (1949). The combination of probabilities arising from data in discrete distributions. Biometrika, 36: 370-382. Reference/Introduction.
  • Lancaster HO (1961). Significance tests in discrete distributions. Journal of the American Statistical Association, 56: 223–234. Abstract. PDF Download
  • Haber M (1986). A modified exact test for 2 × 2 contingency tables. Biometrical Journal, 28 (4): 455–463. Abstract: A modified exact test is proposed for 2×2 contingency tables. This test, which is based on a less conservative definition of the concept of significance (Stone, 1969) is compared with a modified form of Pearson's X2 test and with Tocher's randomized exact (UMPU) test. The sizes of the new test lie near the nominal 0.05 levels while those of the X2 test usually exceed the nominal level, sometimes by a factor of 2 or more. The power of the modified test is usually close to that of the UMPU test.
  • Barnard, George A (1989). On alleged gains in power from lower p-values. Statistics in Medicine, 8: 1469-1477. Abstract: The various suggestions that have been made to increase the power of Fisher's test for 2 x 2 tables are shown to give no real increases. One-sided tests are examined in detail but two-sided test problems are also considered. The need for flexibility in use of P-values is stressed.
  • Barnard, George A (1990). Must clinical trials be large? The interpretation of p-values and the combination of test results. Statistics in Medicine, 9: 601-614. Abstract: The notion that small, well planned clinical trials may not be worth undertaking is shown to arise from an overemphasis on just one way of interpreting P-values. Alternative forms of P and other interpretations are put forward. Attention is drawn to some aspects of the theory of hypothesis testing which seem less well known than they should be.
  • Hirji KF, Tau S-J, Elashoff RM (1991). A Quasi-Exact Test for Comparing Two Binomial Proportions. Statistics in Medicine, 10: 1137-1153.  - Read Online & Download Page. PDF Download. Abstract: The use of the Fisher exact test for comparing two independent binomial proportions has spawned an extensive controversy in the statistical literature. Many critics have faulted this test for being highly conservative. Partly in response to such criticism, some statisticians have suggested the use of a modified, non-randomized version of this test, namely the mid-P-value test. This paper examines the actual type I error rates of this test. For both one-sided and two-sided tests, and for a wide range of sample sizes, we show that the actual levels of significance of the mid-P-test tend to be closer to the nominal level as compared with various classical tests. The computational effort required for the mid-P-test is no more than that needed for the Fisher exact test. Further, the basis for its modification is a natural adjustment for discreteness; thus the test easily generalizes to r x c contingency tables and other discrete data problems.
  • Upton, Graham JG (1992). Fisher's Exact Test. Journal of the Royal Statistical Society. Series A (Statistics in Society), 155(3): 395-402. Abstract: This paper reviews the problems that bedevil the selection of an appropriate test for the analysis of a 2 x 2 table. In contradiction to an earlier paper, the author now argues the case for the use of Fisher's exact test. It is noted that all test statistics for the 2 x 2 table have discrete distributions and it is suggested that it is irrational to prescribe an unattainable fixed significance level. The use of mid-P is suggested, if a formula is required for prescribing a variable tail probability. The problems of two-tail tests are discussed.
Excerpt: However, when X is discrete, E{T(X)} > 0.5, implying that the Fisher tail areas 'are "biased" in an upward direction' (Barnard, 1989). To correct this problem Lancaster (1949) suggested the use of the mid-P-value M(x), given by M(x) = P(X>x) + 0.5P(X=x). (1) Further support for the use of mid-P is to be found in Stone (1969) and Anscombe (1981). (p. 399)
  • Routledge RD (1994). Practicing Safe Statistics with the Mid-p. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 22(1): 103-110. Abstract: The mid-p-value is the standard p-value for a test minus half the difference between it and the nearest lower possible value. Its smaller size lends it an obvious appeal to users - it provides a more significant-looking summary of the evidence against the null hypothesis. This paper examines the possibility that the user might overstate the significance of the evidence by using the smaller mid-p in place of the standard p -value. Routine use of the mid-p is shown to control a quantity related to the Type I error rate. This related quantity is appropriate to consider when the decision to accept or reject the null hypothesis is not always firm. The natural, subjective interpretation of a p-value as the probability that the null hypothesis is true is also examined. The usual asymptotic correspondence between these two probabilities for one-sided hypotheses is shown to be strengthened when the standard p-value is replaced by the mid-p.
  • Berry G, Armitage P (1995). Mid-P confidence intervals : a brief review. The Statistician, 44(4): 417-423. PDF Download. PDF Download. Abstract: Significance tests that are based on discrete probabilities are conservative in that the average value of the significance level, when the null hypothesis is true, always exceeds 0.5. An approach suggested by H. O. Lancaster over 40 years ago overcomes this problem. This is to calculate the mid-P value, where only half of the probability of the observed sample is included in the tail. The average value of the mid-P value is 0.5 and the variance is slightly less than that of a random variable uniformly distributed between 0 and 1. The mid-P concept has usually been advocated in the context of significance testing but it can be extended to the calculation of confidence intervals in an estimation approach by defining, for example, the 95% mid-P confidence limits as the values that have a one-sided mid-P value of 0.025. In this paper we review recent work supporting this approach.
  • Agresti, Alan (2001). Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine, 20(17-18): 2709-2722. PDF Download. Abstract: Methods for exact small-sample analyses with categorical data have been increasingly well developed in recent years. A variety of exact methods exist, primarily using the approach that eliminates unknown parameters by conditioning on their sufficient statistics. In addition, a variety of algorithms now exist for implementing the methods. This paper briefly summarizes the exact approaches and describes recent developments. Controversy continues about the appropriateness of some exact methods, primarily relating to their conservative nature because of discreteness. This issue is examined for two simple problems in which discreteness can be severe--interval estimation of a proportion and the odds ratio. In general, adjusted exact methods based on the mid-P-value seem a reasonable way of reducing the severity of this problem.
  • Hwang JTG, Yang M-C (2001). An optimality theory for mid P-values in 2x2 contingency tables. Statistica Sinica, 11: 807-826. PDF Download. Abstract: The contingency table arises in nearly every application of statistics. However, even the basic problem of testing independence is not totally resolved. More than thirty-five years ago, Lancaster (1961) proposed using the mid p-value for testing independence in a contingency table. The mid p-value is defined as half the conditional probability of the observed statistic plus the conditional probability of more extreme values, given the marginal totals. Recently there seems to be recognition that the mid p-value is quite an attractive procedure. It tends to be less conservative than the p-value derived from Fisher's exact test. However, the procedure is considered to be somewhat ad-hoc. In this paper we provide theory to justify mid p-values. We apply the Neyman-Pearson fundamental lemma and the estimated truth approach, to derive optimal procedures, named expected p-values. The estimated truth approach views p-values as estimators of the truth function which is one or zero depending on whether the null hypothesis holds or not. A decision theory approach is taken to compare the p-values using risk functions. In the one-sided case, the expected p-value is exactly the mid p-value. For the two-sided case, the expected p-value is a new procedure that can be constructed numerically. In a contingency table of two independent binomial samplings with balanced sample sizes, the expected p-value reduces to a two-sided mid p-value. Further, numerical evidence shows that the expected p-values lead to tests which have type one error very close to the nomial level. Our theory provides strong support for mid p-values.
  • Agresti A, Gottard A (2007). Nonconservative exact small-sample inference for discrete data. Computational Statistics & Data Analysis, 51(12): 6447-6458. Abstract: Exact small-sample methods for discrete data use probability distributions that do not depend on unknown parameters. However, they are conservative inferentially: the actual error probabilities for tests and confidence intervals are bounded above by the nominal level. This article surveys ways of reducing or even eliminating the conservatism. Fuzzy inference is a recent innovation that enables one to achieve the error probability exactly. We present a simple way of conducting fuzzy inference for discrete one-parameter exponential family distributions. In practice, most scientists would find this approach unsuitable yet might be disappointed by the conservatism of ordinary exact methods. Thus, we recommend using exact small-sample distributions but with inferences based on the mid-P value. This approach can be motivated by fuzzy inference, it is less conservative than standard exact methods, yet usually it does well in terms of achieving desired error probabilities. We illustrate for inferences about the binomial parameter.
  • Parzen, Emanuel (2009). United Applicable Statistics: Mid-Distribution, Mid-Quantile, Mid P Confidence Intervals Proportion p. Keynote Paper. Advances in Statistics and Applied Probability: Unified Approaches. A Symposium in Honor of Benjamin N. Kedem. PDF Download. Abstract: We review the seminal influence of Ben Kedem on statistical time series analysis. We motivate our research on United Applicable Statistics ("analogies between analogies") approach to a learning framework for almost all of the Science of Statistics, which we distinguish from the Statistics of Science. We describe the exciting probability and statistical theory of mid-distributions, mid-quantiles, new way to calculate (for data with ties) sample quantiles and median (mid), asymptotic normality of mid-distributions of Binomial, Poisson, hypergeometric distributions. We advocate statistical inference by mid-P VALUE function of a parameter whose inverse (under a stochastic order condition) is defined to be confidence quantile (of a confidence distribution). We show mid-P frequentist confidence intervals for discrete data have endpoint function equal to confidence quantile, which is algorithmically analogous to Bayesian posterior quantile. One computes frequentist (without assuming prior) but interprets Bayesian. We conclude with 0-1 data and quasi-exact (Beta distribution based) confidence quantiles of parameters p and log-odds(p). We claim quasi-identity of frequentist mid-P confidence intervals and Bayesian posterior credible intervals with uninformative Jeffreys prior. For parameters of standard probability models, calculating confidence quantiles yields Bayesian posterior quantiles for non-informative conjugate priors.
"A problem often encountered in the practice of statistics is not that we don’t have an answer to your question, but that we have too many answers and don’t know which ones to choose as our “final” answer. A problem with an extensive literature, and many competing answers, is inference for parameters of discrete data, such as the true population proportion p when one observes K successes in n trials. What may be novel is our claim that for a proportion p of 0 − 1 data the mid-P frequentist confidence interval is approximately identical with the Bayesian Jeffrey’s prior credible interval. An important inference method that is unknown (and perhaps difficult to accept) to many statisticians is the “mid-P” approach (usually credited to Lancaster, 1961). This paper presents theory to justify this frequentist approach and argues that it can be recommended as the “final” (benchmark) answer because it is identical with the Bayesian answer for a Jeffrey’s Beta(.5,.5) prior for p. While for large samples other popular answers are approximately numerically equivalent, introductory courses will be happier if we teach only one way, the “right” way, the way that is accurate for small samples and zero successes. It is easy to compute from software for the quantile function of the Beta distribution."
  • Fellows, Ian (2010). The Minimaxity of the Mid P-value under Linear and Squared Loss Functions. Communications in Statistics - Theory and Methods, 40(2): 244-254. Abstract / First Page. - Discussion/Conclusion: .Although the mid-p was developed over 50 years ago, practitioners have yet to adopt it into everyday practice. Because the mid-p can have Type I error greater than its nominal level, it could possibly be viewed as deceptive (Routledge, 1994). This view is understandable considering the lack of solid theoretical grounding. The mid-p has, with some exceptions (Lancaster, 1961; Hwang and Yang, 2001), been justified using heuristic devices (Barnard, 1989; Berry and Armitage, 1995) and numerical simulation (Hirji, 1991). While these justifications do provide some comfort that the mid-p is safe, one would be correct to be somewhat uneasy with its wholesale application to all problems. Under estimated truth framework, the mid-p has a solid claim to primacy. The minimax theorems developed in this article show that in the worst-case scenarios, the mid-p is the least risky p-value. These results apply to a surprisingly wide range of situations. In the case of a simple hypothesis test, the mid-p and the likelihood ratio ordering were found to be mutually supportive of one another. In the case of one-sided tests, the mid p-value was shown to dominate all other members of its family. The mid p-value has smaller maximum risk than any of the more conservative members of its family in the case of two sided tests, and dominates all of its family in some important special cases that pervade statistical practice.
  • Biddle DA, Morris SB (2011). Using Lancaster's mid-P correction to the Fisher's exact test for adverse impact analyses. Journal of Applied Psychology, 96(5): 956-965. Abstract: Adverse impact is often assessed by evaluating whether the success rates for 2 groups on a selection procedure are significantly different. Although various statistical methods have been used to analyze adverse impact data, Fisher's exact test (FET) has been widely adopted, especially when sample sizes are small. In recent years, however, the statistical field has expressed concern regarding the default use of the FET and has proposed several alternative tests. This article reviews Lancaster's mid-P (LMP) test (Lancaster, 1961), an adjustment to the FET that tends to have increased power while maintaining a Type I error rate close to the nominal level. On the basis of Monte Carlo simulation results, the LMP test was found to outperform the FET across a wide range of conditions typical of adverse impact analyses. The LMP test was also found to provide better control over Type I errors than the large-sample Z-test when sample size was very small, but it tended to have slightly lower power than the Z-test under some conditions.
  • Crans GG, Shuster JJ (2008). How conservative is Fisher's exact test? A quantitative evaluation of the two-sample comparative binomial trial. Statistics in Medicine, 27(18): 3598-611. Abstract. See also: Martin Andrés A, Herranz Tejedor I (2009). Comments on 'How conservative is Fisher's exact test? A quantitative evaluation of the two-sample comparative binomial trial' by G. G. Crans and J. J. Shuster, Statistics in Medicine 2008; 27:3598-3611. Statistics in Medicine, 28(1): 173-4. Reference.


Roger Berger's Online P-Value Calculator: To calculate the Fisher's Exact Test as Used in Boschloo (1970) & The z-pooled or z-unpooled P-Values from Suissa & Shuster (1985):


Berger, Roger L (1996-2005). Exact Unconditional Homogeneity/Independence Tests for 2X2 Tables. http://www.stat.ncsu.edu/exact/ - "This performs exact, unconditional tests of homogeneity (binomial model) or independence (multinomial model) for 2x2 tables. These tests are usually uniformly more powerful than Fisher's exact test."
  • Berger, Roger L (1996). More powerful tests from confidence interval p values. American Statistician, 50: 314-318. [Download paper]
Abstract: In this article, the problem of comparing two independent binomial populations is considered. It is shown that the test based on the confidence interval p value of Berger and Boos (1994) often is uniformly more powerful than the standard unconditional test. This test also requires less computational time. - Introduction:  The problem of comparing two binomial proportions has been considered for many years. The most commonly used test is Fisher's Exact Test (Fisher, 1935), a conditional test. Barnard (1945, 1947) proposed an unconditional test for this problem. Although unconditional tests are usually more powerful than conditional tests, they are computationally much more complex. But recent advances in computing have made unconditional tests practical, and they are beginning to appear in statistical software packages such as StatXact 3 for Windows. In this article it is shown that unconditional tests based on the confidence interval p value of Berger and Boos (1994) are often uniformly more powerful than the standard unconditional tests.
  • Berger, Roger L (1994). Power comparison of exact unconditional tests for comparing two binomial proportions. Institute of Statistics Mimeo Series No. 2266. [Download paper
  • Berger RL, Boos DD (1994). P values maximized over a confidence set for the nuisance parameter. Journal of the American Statistical Association, 89: 1012-1016. [Download paper
  • Röhmel J (2005). Problems with existing procedures to calculate exact unconditional P-values for non- inferiority/superiority and confidence intervals for two binomials and how to resolve them. Biometrical Journal, 47(1): 37-47. Discussion 99-107. Abstract: Recently several papers have been published that deal with the construction of exact unconditional tests for non-inferiority and confidence intervals based on the approximative unconditional restricted maximum likelihood test for two binomial random variables. Soon after the papers have been published the commercially available software for exact tests StatXact has incorporated the new methods. There are however gaps in the proofs which since have not been resolved adequately. Further it turned out that the methods for testing non-inferiority are not coherent and test for non-inferiority can easily come to different conclusions compared to the confidence interval inclusion rule. In this paper, a proposal is made how to resolve the open problems. Berger and Boos (1994) developed the confidence interval method for testing equality of two proportions. StatXact (Version 5) has extended this method for shifted hypotheses. It is shown that at least for unbalanced designs (i.e. largely different sample sizes) the Berger and Boos method can lead to controversial results.
"In Chapter 14.3.10 of the User Manual for StatXact Version 5.0.3, the Berger and Boos Correction is used for computing an unconditional exact confidence interval for the difference of two binomial parameters based on inverting two 1-sided hypothesis tests. StatXact is noting that Berger and Boos (1994) actually proposed their method only for hypothesis tests and that the application to the calculation of confidence intervals is new. But having shown that the p-values derived with the Berger-Boos correction need not satisfy Barnard’s convexity condition there is no guarantee that searching only on the boundary in non-classical (i.e. shifted) situations will successfully determine the suprema that are necessary for a correct calculation." (p. 45)
      • Boschloo R. D. (1970). Raised conditional level of significance for the 2X2 table when testing the equality of two probabilities. Statistica Neerlandica, 24(1): 1-35. PDF Download. Download Page.
Summary: In this paper it is to be shown that Fisher's non-randomizing exact test for 2x2-tables, which is a conditional test, can by simple means be changed into an unconditional test using raised levels of significance; not seldom, especially for not too large samples, the level of significance can be doubled. This leads in many cases to a considerable increase of power of the test. A table with raised levels has been prepared up to sample sizes of 50 and a rule of thumb, which can be used if this table is not available, has been developed.
  • Exact unconditional p-values for 2X2 tables: This document (based on a suggestion by Tomonori Ishikawa) was abstracted from comments in Professor Roger Berger’s Fortran program XUN2X2 and is used to explain the options available in running the program from his website [http://www.stat.ncsu.edu/exact/]. - Program Overview: This program computes exact unconditional p-values for analyzing 2X2 tables. The program gives the user the choice of using either a binomial or multinomial model, either a one- or two-sided test, and the choice of three test statistics. It is unusual in considering multinomial models and a variety of test statistics...
  • Lin C-Y, Yang M-C (2009). Improved p-Value Tests for Comparing Two Independent Binomial Proportions. Communications in Statistics: Simulation and Computation, 38(1): 78-91. PDF Download. Abstract: In the comparison of two independent binomial proportions for small or moderate sample sizes, Fisher's exact test can be conservative in the sense of its actual significance level (or size) being much less than the nominal level. Boschloo (1970) proposed a procedure, which raises the significance level used for the p-value test derived from Fisher's exact test, such that the size of the p-value test is closer to and below the nominal level. Röhmel and Mansmann (1999) used the unconditional approach, called the supremum procedure, to a valid p-value to obtain a modified p-value test, and showed that the modified p-value test is always better than the original one. In this article, we show that Röhmel and Mansmann's procedure is equivalent to Boschloo's procedure. The supremum procedure will then be applied to some well-known p-value procedures appeared in the literature. Numerical studies including two real cases are given to illustrate the advantages of the proposed modified p-value tests. Numerical results also show that the sizes of modified p-value tests are closer to nominal levels than those of the original p-value tests for many cases, especially in case of unbalanced sample sizes.
  • Lydersen S, Langaas M, Bakke O (2011): The exact unconditional z-pooled test for equality of two binomial probabilities: optimal choice of the Berger and Boos confidence coefficient. Journal of Statistical Computation and Simulation, DOI:10.1080/00949655.2011.579969. Online First. Abstract.
Abstract: Exact unconditional tests for comparing two binomial probabilities are generally more powerful than conditional tests like Fisher's exact test. Their power can be further increased by the Berger and Boos confidence interval method, where a p-value is found by restricting the common binomial probability under H 0 to a 1 - γ confidence interval. We studied the average test power for the exact unconditional z-pooled test for a wide range of cases with balanced and unbalanced sample sizes, and significance levels 0.05 and 0.01. The detailed results are available online on the web. Among the values 10−3, 10−4, …, 10−10, the value γ = 10−4 gave the highest power, or close to the highest power, in all the cases we looked at, and can be given as a general recommendation as an optimal γ.

Online Calculators to Generate Null Hypothesis Unconditional Tests P-Values & Mid-P Values:
  • Roger Berger Calculator (1996, Described Above): http://www.stat.ncsu.edu/exact/. Will Calculate Fisher Exact Tests, Fisher-Boochloo Tests, Fisher Mid-p Value, z-pooled & z-unpooled p-Values.
  • SMP Program, Version 2.1 by A. Silva Mato & A. Martín Andrés (1999): http://www.ugr.es/~bioest/software.htm. Related Papers: http://www.ugr.es/~bioest/smp.htm. To Calculate Bernard Test for Two Independent Binomials (unconditional real p-value & unconditional estimated p-value) , Fisher Unconditional Tests, Fisher Mid-p Value, z-pooled p-Values, & Anscome's arc sine p-value statistics. According to Martín Andrés (personal communication, January, 2012), there is no other software solution to run the original Barnard test. At the site, the TMP Program is available to produce Barnard Tests & Other Tests for Multinomials.
  • WinPepi (http://www.brixtonhealth.com/): Will Calculate Fisher Exact Tests, Fisher Mid-P, Upton "N - 1" Chi Squared, Log-Likelihood Chi Squared, Cohen's Effect Size, & More.

 




Part 2: "An Expanded Homosexuality Factor in Adolescent Suicide"
Adding Those Harassed Because They Were Assumed to be Homosexual
To the "At Risk" Homosexual Adolescents Who Died by Suicide in

The Shaffer et al. (1995) & Renaud et al. (2010) Studies

See: Tabulated Information



The objective of this section is to show that, by expanding the category of adolescents who are at risk for greater suicidality beyond the gay / lesbian / bisexual category - that is, by including those who had experienced harassment / abuses based on the "assumption" by others that they are gay or lesbian (ascribed identity, when they may be identifying as heterosexual) - we therefore become more inclusive with the homosexuality related "At Risk" category. Elsewhere, as related to data from the 1995 the Seattle Youth Risk Behavior Survey, "The Total Homosexuality Factor in Adolescent Suicidality" was described as consisting of all adolescents identifying as gay, lesbian, bisexual, or unsure or their sexual orientation, with varying proportions of each group who reports having been targeted for anti-gay harassment, plus all heterosexual identified adolescents targeted for anti-gay harassment and those engaging in same-sex sex. In the 1995 Seattle Youth Risk Behavior Survey, for example, the risk for a suicide attempt for males who were unsure of their sexual orientation or those who identified as heterosexual was higher than for their counterparts not targeted for anti-gay harassment. Of all groups, the increase in risk for a suicide attempt requiring medical attention was greatest for heterosexual identified males targeted for anti-gay harassment, compared to heterosexual identified males not targeted for such harassment (8 / 122 (6.6%) vs. 45 / 3197 (1.4%), Odds Ratio: 2.1 < 5.0 < 11.3, p = 0.001. For the analyses below, the "At Risk" category will be limited to those adolescents deemed to be homosexually oriented by those interviewed in the psychology autopsy studies, and adolescents known by the interviewees to have been anti-gay/lesbian harassed. Not included, for example, would be heterosexual identified males who were not anti-gay harassed, but had been sexually active with other males. For this reason, the analysis is titled: "An Expanded Homosexuality Factor in Adolescent Suicide" as opposed to "The [Total] Homosexuality Factor in Adolescent Suicide." Below are some of the reported highly negative effects resulting for mostly heterosexual identified adolescents who were targeted for anti-gay harassment.

Being Known or Assumed to be "Gay" or "Lesbian": Examples of Negative Effects on Heterosexual and Sexual Minority Adolescents
Jackson PS, Peterson J (2004). Depressive disorder in highly gifted adolescents. The Journal of Secondary Gifted Education, 14(3): 175–186. Full Text.
"Another highly gifted boy, 13, described the onset of a major depressive episode following the death of his pet canary:

I felt like something irreparable happened when my bird Merlin died. I was so sick, I mean physically sick, day after day. I had the flu constantly. I felt that I had the flu; when you think about that, it is “funny,” ironic. I had the flu and Merlin “flew” away. I mean, the stress of that made me sick. I made myself sick with that — I know that I did.

When it gets really bad, when the kids at school are calling me gay — can you tell me why they call me gay? I try to cope, I try to say “no troubles, just one more year here.” But, then something happens, and it happens very, very quickly.

I leave my body. I leave my mind. . . . I don’t know if I am here. And then I know that I am dying. I mean my actual brain is here, but my soul has left my body. I am screaming inside: “It is all so stupid, stupid, stupid!” I am dying here.

This young man’s physical manifestation of distress became so severe that he could no longer leave the house. He lost his appetite and became emaciated. However, a comprehensive series of physical exams revealed no irregularities:

These tests - I feel badly about the cost and bother to the medical system. I mean, there is nothing wrong with me; nothing physically wrong with me. It is my soul that is dying here. I keep telling them that is what is wrong, and they keep searching for clues in my body. That is not what the problem is."


Wodnik B, North S (1997). Not race, but who she loved, brought abuse on Everett teen. Full Text.
She's also a lesbian. And because of that she was attacked and her ankle broken on the way home from Everett High School. She was spit on and called "dyke." She had full cans of pop thrown at her head and once received a note signed by about a dozen fellow students saying they'd pay a million dollars to see her burned at the stake in the high school auditorium. Sometimes she lay in her bed at night and prayed to God to end her life.

Donaldson James, Susan (2009). When Words Can Kill: 'That's So Gay'. ABC News. Full Text.
Carl Joseph Walker-Hoover was 11-- hardly old enough to know his sexuality and yet distraught enough to hang himself last week after school bullies repeatedly called him "gay."
Hanlon , Joleen (2009). A Tragic Lesson in Anti-Gay Bullying. Education Week. Full Text.
This 11-year-old [Carl Joseph Walker-Hoover] sent the world a powerful message, one demonstrating just how painful words can be. His death also provided educators with another illustration of the need to address homophobic attitudes in schools. The consequences of anti-gay bullying may be difficult at times to see, but they can forever alter, and sometimes end, the life of a child. It is time for educators to stop overlooking anti-gay language and start responding to it with the same vigor we would to the expression of racist attitudes.
The fact is that this type of hate language, used against lesbian, gay, bisexual, and transgender, or LGBT, youths, is common in American schools. Students who are LGBT - or are perceived to be - are frequently bullied. In fact, sexual orientation is, according to a 2005 nationwide survey, the second most common reason for repeated harassment in schools. "Words such as 'gay,' 'fag,' and 'queer' are often used as the most hurtful insults students can throw at one another."

Badash, David (2010). Bullied 14-Year Old Called “Gay, Girly, Fag” Commits Suicide. News: The New Civil Rights Movement.
Full Text.

Fitz, Timothy (2010). Billy Lucas: Teen Commits Suicide After Being Called Gay and Told to Kill Himself. Chicago News Report.
Full Text.

Eckholm, Erik (2011). Eight Suicides in Two Years at Anoka-Hennepin School District. The New York Times.
Full Text.
Brittany Geldert, 14, another plaintiff, has called herself bisexual since seventh grade and said she had repeatedly been called “dyke” while teachers looked the other way. Her grades plummeted, her poetry took a dark turn and she has been hospitalized for severe depression and suicidal thoughts.

Melloy, Kilian (2011). Was Straight Teen’s Death a Result of Homophobic Bullying? Edge, Boston. Full Text.
An Ohio teenager died from an apparently self-inflicted gunshot wound after suffering homophobic bullying at school, reported Akron newspaper the Beacon Journal in a story that Queerty.com picked up. The story related how Nicholas Kelo, Jr., 13, of Rittman, Ohio, suffered at s chool after fellow students assumed that he was gay for setting aside football and taking up band once he reached high school.

----- (2012)
. Zachery Gray, [heterosexual identified] Florida Teen, Suffers Brain Damage After Suicide Attempt Allegedly Due To Anti-Gay Bullying. Huffington Post.
Full Text.


Long Island Gay, Lesbian, Bisexual and Transgender Network (2012). Current News. Full Text.
"It is with sad news to report that David Hernandez, a 16-year-old East Hampton High School student, committed suicide over a week ago. Reports have surfaced that this suicide was allegedly tied to David being bullied and harassed because of the perception that he was gay. This is a difficult time for David’s family, classmates, friends, and the entire East Hampton community. Gay teen suicide is an epidemic and this most recent suicide hits close to home as it has happened right in our own backyard. We have been working closely with East Hampton school administration and officials since this tragedy to ensure that support is available for the healing that needs to take place, and to create a plan to move forward in ensuring no other teen, GLBT or heterosexual, feels the need to take their own life."

[Important Note: There may be an ignored epidemic of heterosexual identified teen suicide associated with anti-gay bullying, as in being bullied/abused because they are assumed to be gay or lesbian. In the Seattle 1995 Youth Risk Behavior Survey, of the males reporting being targeted for anti-gay harassment and having attempted suicide, 7 were gay/bisexual identified, 7 were unsure of their sexual orientation, and 20 (the majority) were heterosexual identified. For males reporting being targeted for anti-gay harassment and having attempted suicide that required medical attention, 4 were lesbian/bisexual identified, 1 was unsure of his sexual orientation, and 8 (the majority) were heterosexual identified. Of the females reporting being targeted for anti-lesbian harassment and having attempted suicide, 16 were gay/bisexual identified, 10 were unsure of their sexual orientation, and 66 (the majority) were heterosexual identified. For females reporting being targeted for anti-lesbian harassment and having attempted suicide that required medical attention, 6 were gay/bisexual identified, 4 were unsure of his sexual orientation, and 27 (the majority) were heterosexual identified. To date, heterosexual identified youth have been largely ignored in studies reporting on anti-gay/lesbian harassment / victimization / violence.]

The American Academy of Pediatrics (2012). Cyberbullying Only Rarely the Sole Factor Identified in Teen Suicides. Full Text.

Most teen suicide victims are bullied both online and in school, and many suicide victims also suffer from depression... researchers searched the Internet for reports of youth suicides where cyberbullying was a reported factor. Information about demographics and the event itself were then collected through searches of online news media and social networks... The study identified 41 suicide cases (24 female, 17 male, ages 13 to 18) from the U.S., Canada, the United Kingdom and Australia. In the study, 24 percent of teens were the victims of homophobic bullying, including the 12 percent of teens identified as homosexual and another 12 percent of teens who were identified as heterosexual or of unknown sexual preference.

Robinson JP, Espelage DL (2012). Bullying Explains Only Part of LGBTQ–Heterosexual Risk Disparities: Implications for Policy and Practice. Educational Researcher, 41(8): 309–319. Abstract. Abstract Excerpt: Our sample consisted of 11,337 students in Grades 7 through 12 from 30 schools in Dane County, Wisconsin. Using both multilevel covariate-adjusted models and propensity-score-matching models, we found that although victimization does explain a portion of the LGBTQ–heterosexual risk disparities, substantial differences persist even when the differences in victimization are taken into account. For example, LGBTQ-identified students were 3.3 times as likely to think about suicide (p < .0001), 3.0 times as likely to attempt suicide (p = .007), and 1.4 times as likely to skip school (p = .047) as propensity-score-matched heterosexual-identified students within the same school who reported equivalent levels of peer victimization.


Takeuchi, Craig (2012). Canadian-led $2-million study to examine homophobic bullying in schools. Vancouver Free Press.
Full Text. [Hainsworth, Jeremy (2012). UBC study to evaluate success of anti-homophobia programs. Xtra, February 21. Full Text.]
The study will not be limited to queer youth but will encompass how straight students are impacted as well. Contrary to misconceptions that only one particular demographic, namely queer youth, is affected, homophobia can affect and be used against all youth, whether straight or queer. Take the example of former North Vancouver high school student Azmi Jubran, who identifies as straight, who won a landmark case against the North Vancouver School District in 2005. He took the school district to the B.C. Human Rights tribunal for failing to do anything about the homophobic bullying he was subjected to for five years. "In any high school, there are far more heterosexual teens than lesbian, gay, bisexual, or questioning teens, and because of this, we have found half or more of those targeted for anti-gay harassment actually identify as straight," UBC School of Nursing professor and principal investigator Elizabeth Saewyc stated in a news release. "There isn't much research about them, but what there is suggests they have the same health consequences as LGBTQ youth who are bullied." 

Hill C, Kearl H (2011)
. Crossing The Line: Sexual Harassment at School. Washington, DC: American Association of University Women (AAUW). Full Text. Full Text.
"Boys were most likely to cite being called gay in a negative way in person as their most negative experience of sexual harassment. Girls and boys were equally likely to experience this type of sexual harassment (18 percent of students surveyed), but 21 percent of boys and only 9 percent of girls identified being called gay or lesbian as their worst experience of sexual harassment." [National Survey, Grades 7 to 12]

Both the Shaffer et al. (1995) & Renaud et al. (2010) Studies asked informants (for suicide victims and living controls) if they knew about a suicide victim or control having been "teased" for being gender nonconformable; that is, a boy was 'teased' for being feminine or a girl was 'teased' for being masculine... noting here that gender nonconformity has been generally used to justify suspicions or assumptions that an adolescent (or even an adult) is gay or lesbian (see Rees-Turyn et al., 2008; Bering, 2010; Wade, 2011). In fact, numerous studies have reported that the greatest difference between gay and heterosexual identified males is that gay males are much more gender nonconformable: being more feminine or effeminate than heterosexual males, beginning in childhood. The same, but to a lesser extent, seems to apply for the gender nonconformity of lesbian identified females (Bailey & Zucker, 1995; Lippa, 2008). Childhood gender nonconformity (CGN) has been associated with lower psychological well-being for pre-adolescents (Yunger et al., 2004) and early age adolescents (Memon, 2011) and with anxiety in both gay and heterosexual men, but not in lesbian or heterosexual women (Lippa, 2008). CGN has been linked to current suicidality (in the past year) for gay, lesbian and bisexual adults, but not for their heterosexual counterparts (Plöderl & Fartacek, 2009) and, historically (men born between about 1910 and 1955), with higher levels of attempting suicide to the age of 20 years for predominantly homosexual males studied by Bell & Weinberg (1978). As reported from a data set re-analysis by Tremblay & Ramsay (2002), the incidence for attempting suicide by predominantly homosexual males with the greatest femininity in childhood compared to those with the greatest masculinity was 4.9% vs. 21.6%, with those reporting moderate childhood femininity having an incidence of 10.9%. A similar "attempted suicide" incidence pattern was reported by Remafedi et al. (1991) on the basis of the BEM Sex Role Inventory for 14- to 21-year-old gay and bisexual males: Feminine (49%), Androgynous & Undifferentiated (30%) and Masculine (11%).

In 2007, Ramsay & Tremblay constructed a webpage titled: "Heterosexual Identified People Reporting Some Same-Sex Attraction or Sexual Behavior - OR: Assumed to be Homosexually Oriented [Are] At Risk For Suicidal Behaviors". The webpage content expanded on the suicidality related concept - highlighted in a more general way and with a focus on muti-'racial' adolescents - by Tremblay & Ramsay (2003): "Socially Constructed Binaries & Youth Suicidality" and by Ramsay & Tremblay (2004): "Bi/Multi-Racial & Aboriginal Adolescents At High Risk For Suicidality" (Conference Index Page). The 2007 presentation focused on the association of higher suicidality with experiencing 'anti-gay/lesbian harassment' for heterosexual identified adolescents and young adults, that also applied for adolescents unsure of their sexual orientation, and that such harassment was most likely based on the gender nonconformity of individuals (A "No-Man's Land" situation in a world of binary perceptions: being behaviorally male or female) as opposed to harassers generally knowing anything for sure about their same-sex sexual desires or behaviors. The possibility that adolescent suicide - and especially adolescent male suicide - is as much linked (or even a little more linked) to males who had experienced 'anti-gay harassment' because they are noticeably gender nonconformable (as reported by informants) as it would be linked to being gay or bisexual (as also reported by informants) - is suggested by the Renaud et al. (2010) results. There were 3 males deemed to be homosexually oriented, with 2 of them known to have been 'teased' for being feminine. Another 2 males were also known to have been 'teased' for being feminine. Therefore, taken separately, there were 3 males deemed to be homosexually oriented in the male suicide victim group (2 of whom were 'teased' for being feminine), but 4 males in total were reported by informants to have been 'teased' for being feminine. In the Shaffer et al. (1995) study, 3 males were deemed to be homosexually oriented in the male suicide victim group (1 of whom was 'teased' for being feminine), but 4 male suicide victims were reported by informants to have been 'teased' for being feminine. Because of low counts in both studies, differences between the two factors would not be statistical significant. Nonetheless, it may be reasonable to say that the risk factor of having been been teased because of appearing gender nonconforming and the risk factor of belonging to a sexual minority group likely contribute to a general homosexuality related risk factor. Therefore, it is reasonable to use "An Expanded Homosexuality Factoranalyses that is presented below.



Table AA: An Expanded Homosexuality Factor In Adolescent Suicide
Combining: Those Deemed to be Homosexually Oriented, Plus
Those Harassed/Abused* Because They Were Gender Nonconformable, or
Likely Suspected to be Homosexual and Treated/Abused* Accordingly

Study
Homosexuality
As Suicide Risk Factor
p-Values
Study
Information
Shaffer
et al.
(1995)
Fisher Test,
One Sided:
p = 0.008
Fisher-Boschloo:
p = 0.005
z-Pooled,
2 methods: p =
0.004, 0.003

Fisher Mid-P:
p = 0.004
Barnard's Test,
Real p-Value:
0.002
Estimated p-Value: 0.002
Arcsine Stat:
p = 0.002
Adolescent Males & Females
(See Above Table for  Study Information)
3 Males - Deemed Homosexual -  out of 95 Male Suicide Victims.
0 Females - Deemed Homosexual - out of 25 Female Suicide Victims
Total: 3 Homosexual Suicide Victims out of 120
3 Additional Male & 0 Female Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
0 Male & 0 Female Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
0 Male & 0 Female Controls Deemed  to be Homosexual.
Therefore: 6 Homosexuality-Related Suicide Victims.
For 2 X 2 Statistical Calculations:
6 Homosexuality-Related Suicide Victims & 114 Non-Homosexuality Related Suicide Victims (6 / 120 = 5.0%) - vs - 0 Homosexuality Related Controls & 147
Non-Homosexuality Related Controls (0 / 147 = 0.0%).
Odds Ratio Calculation = Not Possible
Renaud
et al.
(2010)


Fisher Test,
One Sided:
p = 0.081
Fisher-Boschloo:
p = 0.054
z-Pooled,
2 methods: p =
0.048, 0.047
Fisher Mid-P:
p = 0.048
Barnard's Test,
Real p-Value:
0.044
Estimated p-Value: 0.047
Arcsine Stat:
p = 0.047
Adolescent Males & Females
(See Above Table for Study Information)

3 Males - Deemed Homosexual -  out of 43 Male Suicide Victims.
  1 Female - Deemed Homosexual - out of 12 Female Suicide Victims
Total: 4 Homosexual Suicide Victims out of 55
  2 Additional Males & 1 Female Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
 
0 Male & 2 Female Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
 
3 Males & 1 Female Controls Deemed  to be Homosexual.
Therefore: 7 Homosexuality-Related Suicide Victims.
For 2 X 2 Statistical Calculations:   7 Homosexuality-Related Suicide Victims & 48 Non-Homosexuality Related Suicide Victims (7 / 55 = 12.7%) - vs - 2 Homosexuality-Related Controls & 53 Non-Homosexuality Related Controls (2 / 55 = 3.6%).
OR = 0.86<3.9<17.1
Combining the Above Shaffer et al. (1995) & Renaud et al. (2010) Study Results

Two Studies Meta-Analytically Combined: Males & Females

Results - Bayesian Analysis & Meta-Analysis - Tian (2009) Method: Table 1aa
Results - Arcsine Difference Meta-Analysis: Table 1b
Results - Continuity Correction Meta-Analysis: Table 1c
Results - TAC (Treatment Arm Correction) Meta-Analysis: Table 1d
Results - Peto Method ORs & Meta-Analysis: Table 1e

Note: All results indicate that homosexually oriented adolescents - and those harassed because they were assumed to be homosexual - are at greater risk for suicide.
Shaffer
et al.
(1995)
Fisher Test,
One Sided:
p = 0.008
Fisher-Boschloo:
p = 0.005
z-Pooled,
2 methods: p =
0.004, 0.004

Fisher Mid-P:
p = 0.004
Barnard's Test,
Real p-Value:
0.002
Estimated p-Value: 0.002
Arcsine Stat:
p = 0.002
Adolescent Males Only
(See Above Table for  Study Information)
3 Males - Deemed Homosexual -  out of 95 Male Suicide Victims.
3 Additional Males Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
0 Male Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
0 Male Controls Deemed  to be Homosexual.
Therefore: 6 Homosexuality-Related Suicide Victims.
For 2 X 2 Statistical Calculations: 6 Male Homosexuality-Related Suicide Victims & 89 Male Non-Homosexuality Related Suicide Victims (6 / 95 = 6.3%)  - vs - 0 Homosexuality-Related Controls & 116 Non-Homosexuality Related Controls (0 / 116 = 0.0%).
Odds Ratio Calculation = Not Possible
Renaud
et al.
(2010)
Fisher Test,
One Sided:
p = 0.028
Fisher-Boschloo:
p = 0.016
z-Pooled,
2 methods: p =
0.012, 0.012

Fisher Mid-P:
p = 0.014
Barnard's Test,
Real p-Value:
0.010
Estimated p-Value: 0.010
Arcsine Stat:
p = 0.010
Adolescent Males Only
(See Above Table for Study Information)
3 Males - Deemed Homosexual -  out of 43 Male Suicide Victims.
  2 Additional Male Suicide Victims Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
0 Male Controls Known by Informants to have been 'Teased' for Gender Nonconforming Reasons.
0 Male Controls Deemed  to be Homosexual.
Therefore: 5 Homosexuality-Related Suicide Victims.
For 2 X 2 Statistical Calculations: 5 Male Homosexuality-Related Suicide Victims & 38 Male Non-Homosexuality Related Suicide Victims (5 / 43 = 11.6%)  - vs - 0 Homosexuality-Related Controls & 43 Non-Homosexuality Related Controls (0 / 43 = 0.0%).
Odds Ratio Calculation = Not Possible
Combining the Above Shaffer et al. (1995) & Renaud et al. (2010) Study Results
Two Studies Meta-Analytically Combined: Males
Results - Bayesian Analysis & Meta-Analysis - Tian (2009) MethodTable 2aa
Results - Arcsine Difference Meta-Analysis: Table 2b
Results - Continuity Correction Meta-Analysis: Table 2c
Results - TAC (Treatment Arm Correction) Meta-Analysis: Table 2d
Results - Peto Method ORs & Meta-Analysis: Table 2e
Note: All results indicate that homosexually oriented male adolescents - and males harassed because they were assumed to be homosexual - are at greater risk for suicide.

Statistical Calculations:
Fisher Exact Test, one-sided, carried out at:
http://statpages.org/ctab2x2.html and at http://www.stat.ncsu.edu/exact/ where the following tests were also generated: Fisher-Boschloo Test, one-sided (Confidence Interval method) & z-Pooled one-sided unconditional p-values, 2 methods: with/without confidence interval method (See: Lyderson et al., 2011). Fisher Mid-P Values calculated at:  http://www.quantitativeskills.com/sisa/statistics/fisher.htm. Barnard's Tests and Arcsine Statistic generated with SMP Program, Version 2.1 http://www.ugr.es/~bioest/software.htm


*In both studies, the word "teased" is used for those most likely harassed or abused because adolescents were being gender nonconformable (generally a proxy for being gay or lesbian): feminine or effeminate for boys and masculine for girls. In a paper titled, "Homophobic teasing, psychological outcomes, and sexual orientation among high school students: What influence do parents and schools have?" the abstract begins with a description of what the "teasing" happens to be:  "Homophobic teasing is often long-term, systematic, and perpetrated by groups of students (Rivers, 2001); it places targets at risk for greater suicidal ideation, depression, and isolation (Elliot & Kilpatrick, 1994)" (Espelage et al., 2008). A national Canadian study exploring such 'teasing' clarified the issue via a news item titled, Gay teens 'terrorized' in Canada's schools (2009, Study). A similar concept was used to title a related American law paper: "Tormented: Anti-Gay Bullying in Schools" (Waldman, 2012). From Ireland, Harland (2009) reports on his study results that would generally apply for boys in the English speaking western world: "Homophobic attitudes are common in contemporary male youth culture in Northern Ireland (Beattie, 2004), as young males believe they are expected to refute any behaviour construed as feminine or that which contravenes traditional masculine stereotypes. In Harland’s (2000) study the young men poured scorn on the thought of other males displaying traits they perceived as feminine. This association was always linked directly to derogatory remarks such as “fruits” and “queers”. Such comments reveal the extent of repulsion and prejudice that many young men hold towards effeminate behaviour, and why they feel so much pressure to be perceived by others as matching up to the masculine ideal." The expression “teasing” may thus downplay what is more likely to be "harassment," “abuse” or “violence”. In Canadian high schools, 'gay/lesbian/bisexual/queer' are generally not known by other students to be 'gay/lesbian/bisexual/queer'. That is, they are in the closet, and all abuses they experience is because they are "suspected" of being  'gay/lesbian/bisexual/queer'. The rarity of 'gay/lesbian/bisexual/queer' being "out" to others in North American high schools is illustrated in an online post by a gay male - 15-year-old Jamie Hubley - who died by suicide in October, 2011:
"I hate being the only open gay guy in my school… It f***ing sucks, I really want to end it. Like all of it, I not getting better theres 3 more years of high school left, Iv been on 4 different anti -depressants, none of them worked. I’v been depressed since january, How f***ing long is this going to last. People said “It gets better”. Its f***ing bull****. I go to see psychologist, What the f*** are they suppost to f***ing do? All I do is talk about problems, it doesnt make them dissapear?? I give up."" (Jamie Hubley, Gay 15-Year-Old Ottawa, Canada Teen Commits Suicide, Cites Depression, School Troubles, October 17, 2011: http://www.huffingtonpost.com/2011/10/17/jamie-hubley-commits-suicide_n_1015646.html.
Jamie, however, had been bullied long before he decided to become openly gay in grade 10, and he was bullied based on the 'gender nonconforming based' suspicion that he was gay: "There are some reports in the media and on social media that James was bullied. This is true. We were aware of several occasions when he felt he was being bullied. In Grade 7 he was treated very cruelly simply because he liked figure skating over hockey." (Father says ‘bullying was definitely a factor’ in son Jamie Hubley’s suicide, by Darryl Morris, October 17, 2011: http://www.lgbtqnation.com/2011/10/father-says-bullying-was-definitely-a-factor-in-son-jamie-hubleys-suicide/. It should be noted that, if Jamie had ended his life via suicide before he outed himself to others, the best informers could have said was that he was "cruelly" bullied for gender nonconforming reasons, such as being called a "sissy" - more like greatly abused for being deemed a "sissy" - and likely also "gay" - because he liked figure skating. Most informers, in cases involving the suicides of such boys, might have said that the boys were "teased" for being a sissy, that would be an monumental  understatement of the "terrorism" they likely experienced.




Table 1aa: Proportion difference for sexual minority males and females in the suicide and living controls groups.
Bayesian and Metanalysis Results: Males and Females

See: Related Information.




Table 2aa: Proportion difference
for sexual minority
 males in the suicide and living controls groups.

  Bayesian and Metanalysis Results: Males and Females

See: Related Information
.


Related Information: Proportion differences analyses for the Shaffer et al. (1995) and Renaud et al. (2010) studies carried out with the Tian et al. (2009) procedure. The 95% confidence intervals of the meta-analytically combined proportion differences, with five different methods, are based on the Tian et al. (2009) procedure. Sampling was set to BB = 100,000. All combined risk differences are statistically significant at p < 0.05. Analyses with "R" were carried out by Dr. Martin Plöderl, Dept. of Suicide Prevention, University Clinic of Psychiatry and Psychotherapy I, Christian Doppler Clinic, Salzburg, Austria. We thank Lu Tian for providing the R-syntax required to carry out the meta-analysis..

Bayesian Methodology. WinBugs Version 1.4.3 (Lunn, Thomas, Best, & Spiegelhalter, 2000) was used in combination with R Version 2.12.1 (R Development Core Team, 2010) and the necessary packages (R2WinBUGS, polspline) for the Bayesian analysis. Analyses with "R" were carried out by Dr. Martin Plöderl, Dept. of Suicide Prevention, University Clinic of Psychiatry and Psychotherapy I, Christian Doppler Clinic, Salzburg, Austria and by Ian Wagenmakers (Department of Psychology, University of Amsterdam), the latter providing the R-syntax.

Tian L, Cai T, Pfeffer MA, Piankov N, Cremieux PY, Wei LJ (2009). Exact and efficient inference procedure for meta-analysis and its application to the analysis of independent 2 x 2 tables with all available data but without artificial continuity correction. Biostatistics, 10: 275-281. Abstract. Full Text. Full Text.
Abstract: Recently, meta-analysis has been widely utilized to combine information across comparative clinical studies for evaluating drug efficacy or safety profile. When dealing with rather rare events, a substantial proportion of studies may not have any events of interest. Conventional methods either exclude such studies or add an arbitrary positive value to each cell of the corresponding 2×2 tables in the analysis. In this article, we present a simple, effective procedure to make valid inferences about the parameter of interest with all available data without artificial continuity corrections. We then use the procedure to analyze the data from 48 comparative trials involving rosiglitazone with respect to its possible cardiovascular toxicity.


Arcsine Difference Meta-Analysis Method:



Table 1b - Arcsine Difference Meta-Analysis*:
Males & Females
The Homosexuality Factor in Adolescent Suicide**
Categories
Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 6
No: 114
120
Yes: 0
No: 147
147
Renaud et al.
(2010)
Yes: 7
No: 48
55
Yes: 2
No: 53
55
Arcsine-Difference-Meta-Analysis-All
Categories
AS
 Arcsin Difference

Cohen's h: Effect Size
Arcsin Related

Formula
AS = arcsin √P1 - arcsin √P2 h = 2(arcsin √P1) - 2(arcsin √P2)
Shaffer et al.
(1995)
P1 = 6/120 = 0.05 -- P2 = 0/147 = 0
0.225 (0.10 - 0.35)
One-/Two-Sided p = 0.002, 0.005***
P1 = 6/120 = 0.05 -- P2 = 0/147 = 0
h = 2 (0.225) - 2 (0)
h = 0.45 (0.20 - 0.70)
Renaud et al.
(2010)
P1 = 7/55: 0.1273 -- P2 = 2/55: 0.0364
AS = 0.3648 - 0.1919 = 0.1729
0.17 (-0.01 - 0.36)
One-/Two-Sided p = 0.047, 0.094***
P1 = 7/55: 0.1273 -- P2 = 2/55: 0.0364
h = 2(0.3648): 0.73 - 2(0.1919): 0.38
h = 0.35 (-0.02 - 0.73)
Two Studies
Combined

0.21 (0.11 - 0.31)
h = 0.42 (0.22 - 0.63)
Effect Size Magnitude for h: 0.20 (Small) - 0.50 (Medium) - 0.80 (Large)

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: Meta-Analysis Using Arcsine Difference.
** Homosexually oriented adolescents & those targeted for anti-gay harassment.
*** Calculated with SMP Program, Version 2.1 http://www.ugr.es/~bioest/software.htm



Table 2b - Arcsine Difference Meta-Analysis*: Males
The Homosexuality Factor in Adolescent Suicide
**
Categories Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 6
No: 89
95
Yes: 0
No: 116
116
Renaud et al.
(2010)
Yes: 5
No: 38
43
Yes: 0
No: 43
43
Arcsine-Difference-Meta-Analysis-Males
Categories
AS
 Arcsin Difference

Cohen's h: Effect Size
Arcsin Related

Formula
AS = arcsin √P1 - arcsin √P2 h = 2(arcsin √P1) - 2(arcsin √P2)
Shaffer et al.
(1995)
P1 = 6/95 = 0.0632 -- P2 = 0/116 = 0
0.25 (0.12 - 0.39)
One-/Two-Sided p = 0.002, 0.005***
P1 = 6/95 = 0.0632 -- P2 = 0/116 = 0
h = 0.50 (0.24 - 0.78)
Renaud et al.
(2010)
P1 = 5/43 = 0.1163 -- P2 = 0/43 = 0
0.35 (0.14 - 0.56)
One-/Two-Sided p = 0.010, 0.019***
P1 = 5/43 = 0.1163 -- P2 = 0/43 = 0
h = 0.70 (0.28 - 1.12)
Two Studies
Combined

0.28 (0.17 - 0.40)
h = 0.56 (0.34 - 0.80)
Effect Size Magnitude for h: 0.20 (Small) - 0.50 (Medium) - 0.80 (Large)

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: Meta-Analysis Using Arcsine Difference.
** Homosexually oriented adolescents & those targeted for anti-gay harassment.
*** Calculated with SMP Program, Version 2.1 http://www.ugr.es/~bioest/software.htm


Odds Ratio Meta-Analyses Using Continuity Corrections:


Table 1c - Odds Ratio Mantel-Haenszel Meta-Analysis: Males & Females - Using "0.5" Continuity Correction*
The Homosexuality Factor in Adolescent Suicide**
Categories Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 6 + 0.5 = 6.5
No: 114 + 0.5 = 114.5
121
Yes: 0.0 + 0.5 = 0.5
No: 147 + 0.5 = 147.5
148
Renaud et al.
(2010)
Yes: 7
No: 48
55
Yes: 2
No: 53
55
Odds-Ratio-Continuity-Correction-Meta-Analysis-All

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: Continuity Correction: Adding 0.5 to each cell, if a zero is in one cell. This is not a good method to use with rare events study results according to The Cochrane Collaboration (2011: See Excerpts). But only one part of the 2x2's - with 0.5 events - would classify as rare events.
** Homosexually oriented adolescents & those targeted for anti-gay harassment.



Table 1d - Odds Ratio Mantel-Haenszel Meta-Analysis: Males & Females - Using "TAC" Continuity Correction*
The Homosexuality Factor in Adolescent Suicide**
Categories Suicide Group
Control Group
TAC
Corrections

Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total

Shaffer et
al. (1995)
Yes: 6 + 0.45 = 6.45
No: 114 + 0.45 = 114.45
120.9
Yes: 0.0 + 0.55 = 0.55
No: 147 + 0.55 = 147.55
148.1
Add 0.45
& 0.55,
Respectively
Renaud et
al. (2010)
Yes: 7
No: 48
56
Yes: 2
No: 53
55
None
odds-ratio-TAC-correction-all

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: TAC - Treatment Arm Correction - if a zero is in one cell. See: TAC Correction Value Formulas given by Diamond et al. (2007). This is not a good method to use with rare events study results according to The Cochrane Collaboration (2011: See Excerpts). But only one part of one of the 2x2's - with 0.55 events - would classify as a rare event.
** Homosexually oriented adolescents & those targeted for anti-gay harassment.




Table 2c - Odds Ratio Mantel-Haenszel Meta-Analysis: Males - Using "0.5" Continuity Correction*
The Homosexuality Factor in Adolescent Suicide**
Categories Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 6 + 0.5 = 6.5
No: 89 + 0.5 = 89.5
96
Yes: 0.0 + 0.5 = 0.5
No: 116 + 0.5 = 116.5
117
Renaud et al.
(2010)
Yes: 5 + 0.5 = 5.5
No: 38 + 0.5 = 38.5
44
Yes: 0.0 + 0.5 = 0.5
No: 43 + 0.5 = 43.5
44
Odds-Ratio-Continuity-Correction-Meta-Analysis-Males

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: Continuity Correction: Adding 0.5 to each cell, if a zero is in one cell. This is not a good method to use with rare events study results according to The Cochrane Collaboration (2011: See Excerpts). But only one part of the 2x2's - with 0.5 events - would classify as rare events.
** Homosexually oriented adolescents & those targeted for anti-gay harassment.




Table 2d - Odds Ratio Mantel-Haenszel Meta-Analysis: Males - Using "TAC" Continuity Correction*
The Homosexuality Factor in Adolescent Suicide**
Categories Suicide Group
Control Group
TAC
Corrections

Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total

Shaffer et
al. (1995)
Yes: 6 + 0.43 = 6.43
No: 89 + 0.43 = 89.43
95.86
Yes: 0.0 + 0.57 = 0.57
No: 116 + 0.57 = 116.57
116.14
Add 0.43
& 0.57,
Respectively
Renaud et
al. (2010)
Yes: 5 + 0.5 = 5.5
No: 38 + 0.5 = 38.5
44
Yes: 0.0 + 0.5 = 0.5
No: 43 + 0.5 = 43.5
44
Add 0.5 to
Each Group
odds-ratio-TAC-correction-men

Meta-Analysis Generated With
"Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Reference: Fixed / Random Effect(s) Models.
* Reference: TAC - Treatment Arm Correction - if a zero is in one cell. See: TAC Correction Value Formulas given by Diamond et al. (2007). This is not a good method to use with rare events study results according to The Cochrane Collaboration (2011: See Excerpts). But only one part of the 2x2's - with 0.57 & 0.43 events - would classify as rare events.
** Homosexually oriented adolescents & those targeted for anti-gay harassment.


The Peto Method Odds Ratio & Meta-Analysis:


Table 1e - Peto Method Odds Ratios
&
Meta-Analysis: Males & Females 1
The Homosexuality Factor in Adolescent Suicide*
Categories
Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 6
No: 114
120
Yes: 0
No: 147
147
Peto Odds Ratio
9.46 (1.87 - 47.91) 2
9.66 (1.90 - 48.98) 3

5Fisher: p = 0.008, 0.015 - Fisher Lower OR CI: 1.48
5Mid-P: p = 0.004, 0.008 - Mid-P Lower OR CI: 1.94
Note: Mid-P Lower CI "1.94" is close to Peto Lower CI "1.90"
6Unconditional Test one-sided p-values from 0.005 to 0.002
Renaud et al.
(2010)
Yes: 7
No: 48
55
Yes: 2
No: 53
55
Odds Ratios

Peto OR: 3.32 (0.85 - 12.89) 2
Peto: OR 3.32 (0.85 - 12.89) 3
Common OR: Fisher: p = 0.081, 0.161 - OR: 3.86 (0.68 - 28.43) 4
Taylor Series OR: 3.86 (0.76 - 19.51)
5
Mid-P OR:  3.82 (0.81 - 27.97) 5
Note: Mid-P Lower CI "0.81" is close to Peto Lower CI "0.85"
6Unconditional Test one-sided p-values from 0.054 to 0.044
Peto Odds Ratio
Meta-Analyses

OR: 5.11 (1.80 - 14.46) 2
OR: 5.15 (1.82 - 14.59) 3
Peto-Odds-Ratio-Meta-Analysis-All

1.
This may be the best method to use with rare events study results according to The Cochrane Collaboration (2011). However, the Peto OR method is recommended for incidences less than 1%, that applies for the 0%, but the other incidences are above 1%: 6 / 120 = 5%, 7 / 55 = 12.7%, 2 / 55 = 3.6%. See comments below table.
2. Peto OR Calculator: DJR Hutchon. Related Meta-Analysis Calculator: DJR Hutchon.
3. Peto ORs generated With "Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Graphed Peto Meta-Analysis Results: Generated With "Meta-Analysis with R. Version 1.6-1" - Reference: Fixed / Random Effect(s) Models.
4. Common Odds Ratio: Calculator.
5. Taylor Series Odds Ratio: OpenEpi. Mid-p OR: Conditional maximum likelihood estimate of Odds Ratio: CMLE Odds Ratio.
6. Unconditional p-values given in data table for two studies.
* Homosexually oriented adolescents & those targeted for anti-gay harassment.

Peto Odds Ratio Method: Bradburn et al. (2007) describe how the Odd Ratios are estimated using the Peto method:  "The Peto one-step method [16: Yusuf et al, 1985] computes an approximation of the log-odds from the ratio of the efficient score to the Fisher information, both evaluated under the null hypothesis. These quantities are estimated, respectively, by the sum of the differences between the observed and expected numbers of events in the treatment arm and by the sum of the conditional hypergeometric variances." (p. 55) The method works well with low incidences (less than 1%) and includes zero events (n = 0) that is present in the Shaffer study. However, it produces increasingly biased results (OR underestimates) when incidences are greater that 1%. For the Shaffer study, the underestimate would apply given the 5.0% incidence (6 / 120), with the same applying for the Renaud study: 12.7% (7 / 55). To note: For the Shaffer study, if "0" was replaced by one, the OR would be 0.90<7.68<171.83 (Fisher exact, one-/two-sided p-values: 0.034, 0.048). The estimated Peto Odds Ratio of "9.66" would therefore be reasonable for a 2x2 that would be more statistically significant when "1" is replaced by the original "0" producing these Fisher exact, one-/two-sided p-values: 0.008, 0.016). The Renaud study Peto OR (3.32) illustrates that fact that, with incidences above 1%, the Peto ORs are underestimation: 3.32 (Peto) vs. 3.86 (Taylor Series) & 3.82 (Mid-P).





Table 2e - Peto Method Odds Ratios
&
Meta-Analysis: Males 1
The Homosexuality Factor in Adolescent Suicide**
Categories
Suicide Group
Control Group
Homosexual or
Related Harassment
Total
Homosexual or
Related Harassment
Total
Shaffer et al.
(1995)
Yes: 6
No: 89
95
Yes: 0
No: 116
116
Peto Odds Ratio 2
9.73 (1.91 - 49.55) 2
9.73 (1.91 - 49.55) 3
4Fisher: 0.008, 0.015 - Lower OR CI: 1.48
4Mid-P: 0.004, 0.008 - Lower OR CI: 1.95
Note: Mid-P Lower CI "1.95" is close to Peto Lower CI "1.91"

5Unconditional Test one-sided p-values from 0.005 to 0.002
Renaud et al.
(2010)
Yes: 5
No: 38
43
Yes: 0
No: 43
43
Peto Odds Ratio 2
8.16 (1.35 - 49.14) 2
8.16 (1.35 - 49.14) 3
4Fisher: 0.028, 0.055 - Lower OR CI: 0.96
4Mid-P: 0.014, 0.028 - Lower OR CI: 1.29
Note: Mid-P Lower CI "1.29" is close to Peto Lower CI "1.35"
5Unconditional Test one-sided p-values from 0.016 to 0.010
Peto Odds Ratio
Meta-Analysis: 2

OR: 8.99 (2.69 - 30.02) 2
OR: 8.99 (2.69 - 30.01) 3
Peto-Odds-Ratio-Meta-Analysis-Men
 
1.
This may be the best method to use with rare events study results according to The Cochrane Collaboration (2011). However, the Peto OR method is recommended for incidences less than 1%, that applies for the 0%, but the other incidences are above 1%: 6 / 95 = 6.3% & 5 / 43 = 11.6%. See comments below the table.
2. Peto OR Calculator: DJR Hutchon. Related Meta-Analysis Calculator: DJR Hutchon.
3. Peto ORs generated With "Meta-Analysis with R. Version 1.6-1" Developed by Schwarzer (2012). Graphed Peto Meta-Analysis Results: Generated With "Meta-Analysis with R. Version 1.6-1" - Reference: Fixed / Random Effect(s) Models.
4. Calculations: OpenEpi. Mid-p OR: Conditional maximum likelihood estimate of Odds Ratio: CMLE Odds Ratio.
5. Unconditional p-values given in data table for two studies.
* Homosexually oriented adolescents & those targeted for anti-gay harassment.


Peto Odds Ratio Method
: Bradburn et al. (2007) describe how the Odd Ratios are estimated using the Peto method:  "The Peto one-step method [16: Yusuf et al, 1985] computes an approximation of the log-odds from the ratio of the efficient score to the Fisher information, both evaluated under the null hypothesis. These quantities are estimated, respectively, by the sum of the differences between the observed and expected numbers of events in the treatment arm and by the sum of the conditional hypergeometric variances." (p. 55) The method works well with low incidences (less than 1%) and includes zero events (n = 0) that are present in both the Shaffer and Renaud studies. However, it produces increasingly biased results (OR underestimates) when incidences are greater that 1%. For the Shaffer study, the underestimate would apply given the 6.3% incidence (6 / 95), with the same applying for the Renaud study: 11.6% (5 / 43).
To note: For the Shaffer study, if "0" was replaced by one, the OR would be 0.90<7.75<174.04 (Fisher exact, one-/two-sided p-values: 0.033, 0.047). The estimated Peto Odds Ratio of "9.73" would therefore be reasonable for a 2x2 that would be more statistically significant when "1" is replaced by the original "0" producing Fisher exact, one-/two-sided p-values: 0.008, 0.008. For the Renaud study, if "0" was replaced by one, the OR would be 0.58<5.53<130.84 (Fisher exact, one-/two-sided p-values: 0.101, 0.202). The estimated Peto Odds Ratio of "8.16" would therefore be reasonable for a 2x2 that would be more statistically significant when "1" is replaced by the original "0" producing these Fisher exact, one-/two-sided p-values: 0.028, 0.055. The Peto Odds Ratios, however, would be more in line with the Mid-p values: Shaffer study Peto OR [9.73 (1.91 - 49.55)], Mid-p values [p = 0.004, 0.008]. Renaud study OR [8.16 (1.35 - 49.14)]. Mid-p values [p = 0.014, 0.028].




References, Excerpts & Abstracts: Meta-Analysis Concepts


Meta-Analysis Using Arcsine Difference

Abstract: For clinical trials with binary endpoints there are a variety of effect measures, for example risk difference, risk ratio and odds ratio (OR). The choice of metric is not always straightforward and should reflect the clinical question. Additional issues arise if the event of interest is rare. In systematic reviews, trials with zero events in both arms are encountered and often excluded from the meta-analysis.The arcsine difference (AS) is a measure which is rarely considered in the medical literature. It appears to have considerable promise, because it handles zeros naturally, and its asymptotic variance does not depend on the event probability.This paper investigates the pros and cons of using the AS as a measure of intervention effect. We give a pictorial representation of its meaning and explore its properties in relation to other measures. Based on analytical calculation of the variance of the arcsine transformation, a more conservative variance estimate for the rare event setting is proposed. Motivated by a published meta-analysis in cardiac surgery, we examine the statistical properties of the various metrics in the rare event setting.We find the variance estimate of the AS to be more stable than that of the log-OR, even if events are rare. However, parameter estimation is biased if the groups are markedly unbalanced. Though, from a theoretical viewpoint, the AS is a natural choice, its practical use is likely to continue to be limited by its less direct interpretation.

"The arcsine transformation was introduced in the statistical literature for its approximative variance-stabilizing property. The key advantage is that a stabilized variance also leads to more robust estimation. If the risks in the treatment arms are estimated with noise, the variance estimate of the AS is less dramatically changed than that of the log-OR, even if events are rare. This is an advantage of the AS as a measure of treatment effect particularly when zero cell studies occur. A disadvantage is that if events are rare in both groups and the groups sizes are markedly unbalanced, bias will be induced by the transformation. In this situation, though, other methods are likewise prone to bias. (p. 735)
Abstract: In meta-analyses, it sometimes happens that smaller trials show different, often larger, treatment effects. One possible reason for such 'small study effects' is publication bias. This is said to occur when the chance of a smaller study being published is increased if it shows a stronger effect. Assuming no other small study effects, under the null hypothesis of no publication bias, there should be no association between effect size and effect precision (e.g. inverse standard error) among the trials in a meta-analysis.A number of tests for small study effects/publication bias have been developed. These use either a non-parametric test or a regression test for association between effect size and precision. However, when the outcome is binary, the effect is summarized by the log-risk ratio or log-odds ratio (log OR). Unfortunately, these measures are not independent of their estimated standard error. Consequently, established tests reject the null hypothesis too frequently. We propose new tests based on the arcsine transformation, which stabilizes the variance of binomial random variables. We report results of a simulation study under the Copas model (on the log OR scale) for publication bias, which evaluates tests so far proposed in the literature. This shows that: (i) the size of one of the new tests is comparable to those of the best existing tests, including those recently published; and (ii) among such tests it has slightly greater power, especially when the effect size is small and heterogeneity is present. Arcsine tests have additional advantages that they can include trials with zero events in both arms and that they can be very easily performed using the existing software for regression tests.
The arcsine difference has a long history, dating back to the 1940s [55, 59, 117, 118, 119], and is often used in other contexts [58, 120, 121], but not, to our knowledge, as a measure of treatment effect in clinical trials. It is nevertheless briefly mentioned in this context in a series of references [16, 23, 56, 122, 123, 124]. Its attraction is that the arcsine transformation is the asymptotically variance-stabilising transformation for the binomial distribution." (p. 78) "Transforming the binomial risk introduces bias, which will be greater for small sample sizes and rare events." (p.86)



Meta-Analysis Using Peto Odds Ratio Method

From Section "16.9.5  Validity of methods of meta-analysis for rare events" (Full Text):

"Bradburn et al. found that many of the most commonly used meta-analytical methods were biased when events were rare (Bradburn 2007).  The bias was greatest in inverse variance and DerSimonian and Laird odds ratio and risk difference methods, and the Mantel-Haenszel odds ratio method using a 0.5 zero-cell correction.  As already noted, risk difference meta-analytical methods tended to show conservative confidence interval coverage and low statistical power when risks of events were low.

At event rates below 1% the Peto one-step odds ratio method was found to be the least biased and most powerful method, and provided the best confidence interval coverage, provided there was no substantial imbalance between treatment and control group sizes within studies, and treatment effects were not exceptionally large. This finding was consistently observed across three different meta-analytical scenarios, and was also observed by Sweeting et al. (Sweeting 2004)...

Methods that should be avoided with rare events are the inverse-variance methods (including the DerSimonian and Laird random-effects method). These directly incorporate the study’s variance in the estimation of its contribution to the meta-analysis, but these are usually based on a large-sample variance approximation, which was not intended for use with rare events. The DerSimonian and Laird method is the only random-effects method commonly available in meta-analytic software. We would suggest that incorporation of heterogeneity into an estimate of a treatment effect should be a secondary consideration when attempting to produce estimates of effects from sparse data – the primary concern is to discern whether there is any signal of an effect in the data."

From Section "9.4.4.2  Peto odds ratio method" (Full Text):

Peto’s method (Yusuf 1985) can only be used to pool odds ratios. It uses an inverse variance approach but utilizes an approximate method of estimating the log odds ratio, and uses different weights. An alternative way of viewing the Peto method is as a sum of ‘O – E’ statistics. Here, O is the observed number of events and E is an expected number of events in the experimental intervention group of each study. The approximation used in the computation of the log odds ratio works well when intervention effects are small (odds ratios are close to one), events are not particularly common and the studies have similar numbers in experimental and control groups. In other situations it has been shown to give biased answers. As these criteria are not always fulfilled, Peto’s method is not recommended as a default approach for meta-analysis.  Corrections for zero cell counts are not necessary when using Peto’s method. Perhaps for this reason, this method performs well when events are very rare (Bradburn 2007) (see Chapter 16, Section 16.9).

Cochrane Collaboration’s Open Learning Material for Cochrane reviewers - From Section "Combining studies: Weighted Averages" (Full Text):

The Peto method: The Peto method works for odds ratios only. Focus is placed on the observed number of events in the experimental intervention. We call this O for 'observed' number of events, and compare this with E, the 'expected' number of events. Hence an alternative name for this method is the 'O - E' method. The expected number is calculated using the overall event rate in both the experimental and control groups. Because of the way the Peto method calculates odds ratios, it is appropriate when trials have roughly equal number of participants in each group and treatment effects are small. Indeed, it was developed for use in mega-trials in cancer and heart disease where small effects are likely, yet very important. The Peto method is better than the other approaches at estimating odds ratios when there are lots of trials with no events in one or both arms. It is the best method to use with rare outcomes of this type. The Peto method is generally less useful in Cochrane reviews, where trials are often small and some treatment effects may be large.




Yusuf S, Peto R, Lewis J, Collins R, Sleight P (1985). Beta blockade during and after myocardial infarction: an overview of the randomized trials. Progress in Cardiovascular Disease, 27(5): 335-71. Abstract.



Continuity Corrections in Odds Ratio Meta-Analyses: "Adding 0.5 to Each Cell Continuity Correction" & "TAC: Treatment Arm Correction," When a Zero is in One Cell.

  • Types of Continuity Correction  • Constant k – typically 0.5  • ‘Treatment arm’ continuity correction (Based on the reciprocal of the opposite group size & Causes less bias when encountering severely imbalanced groups)  • Empirical continuity correction (Based on an empirical estimate of the pooled effect size using the non-zero event studies)  (Sweeting et al, 2004)
  • Commonly Used Statistical Methods  • Fixed effect models (Inverse variance-weighted (I-V) method - Mantel-Haenszel (M-H) method - Peto method - Logistic regression - Bayesian method)  • Random effects models (DerSimonian & Laird (D-L) method - Bayesian method.
  • Comparisons of Commonly Used Existing Methods  • The choice of method in a sparse event meta-analysis is important since certain methods perform poorly; especially when group imbalances exist (Bias is greatest using the I-V and D-L methods, and M-H method with CC of 0.5)  • The M-H method using the alternative CC provides the least biased results for all group imbalances  • At event rates below 1%, the Peto method provides least biased, most powerful and best CI coverage for balanced groups but bias increases with greater group imbalance and larger treatment effect  • Logistic regression performs well and generally unbiased and reliable  • The Bayesian fixed-effect model performs consistently well irrespective of group imbalance  • Alternative CCs perform better than a constant CC.  (Sweeting et al, 2004; Bradburn et al., 2007)
  • A New Advance: Exact Inference Procedure  • A new method using exact inference procedure proposed by Tian, et al (2009): Combine across trial information without excluding double-zero studies or continuity correction  &  Exact inference without relying on large sample approximations.  • RD with associated exact CI can be constructed using the procedure (CI can be over conservative in some cases).

Diamond GA, Bax L, Kaul S (2007).  Uncertain Effects of Rosiglitazone on the Risk for Myocardial Infarction and Cardiovascular Death. Annals of Internal Medicine, 147, 578–581. Full Text.
We estimated the pooled odds ratio as our measure of effect size by using fixed-effects (for example, Mantel–Haenszel) and random-effects (DerSimonian–Laird) models (8). When applicable, we used methods with or without 2 continuity corrections. One is a constant correction (CC) that adds values of 0.5 to all cells of the 2 × 2 contingency table of the study selected for correction. The other is a treatment arm correction (TAC) that adds values proportional to the reciprocal of the size of the opposite treatment group. (See the Appendix, for details.) ... Appendix: This Appendix describes the continuity corrections applied to myocardial infarction data for trial number 49653/04 in Nissen and Wolski's meta-analysis (1). The CC for continuity adds 0.5 to each cell of the 2 × 2 contingency table, effectively increasing the treatment and control group sizes by 1 and the total study sample size by 2 (from 348 uncorrected to 350 corrected). The TAC for continuity adds a value proportional to the reciprocal of the size of the opposite treatment group, normalized to a sum of 1 for event and no-event cells, resulting in an increase in the total study sample size by 2 (identical to that in the CC for continuity). With R being the ratio of group sizes and S being the sum of corrections for event and no event cells, the TAC for continuity adds a factor of R/S*(R + 1) to the larger group and 1/S*(R + 1) to the other group. In the example shown (Appendix Table), S is set to 1 and R is 232/116 = 2. The correction in the (larger) treatment group becomes 2/1*(2 + 1) = 2/3 = 0.67, and that in the (smaller) control group becomes 1/1*(2 + 1) = 1/3 = 0.33.



Fixed Effect & Random Effects Meta-Analyses

"Fixed effect: The fixed effect model assumes that all studies in the meta-analysis share a common true effect size. Put another way, all factors which could influence the effect size are the same in all the study populations, and therefore the effect size is the same in all the study populations. It follows that the observed effect size varies from one study to the next only because of the random error inherent in each study.
Random effects: By contrast, the random effects model assumes that the studies were drawn from populations that differ from each other in ways that could impact on the treatment effect. For example, the intensity of the intervention or the age of the subjects may have varied from one study to the next. It follows that the effect size will vary from one study to the next for two reasons. The first is random error within studies, as in the fixed effect model. The second is true variation in effect size from one study to the next."
Abstract: There are two popular statistical models for meta-analysis, the fixed-effect model and the random-effects model. The fact that these two models employ similar sets of formulas to compute statistics, and sometimes yield similar estimates for the various parameters, may lead people to believe that the models are interchangeable. In fact, though, the models represent fundamentally different assumptions about the data. The selection of the appropriate model is important to ensure that the various statistics are estimated correctly. Additionally, and more fundamentally, the model serves to place the analysis in context. It provides a framework for the goals of the analysis as well as for the interpretation of the statistics. In this paper we explain the key assumptions of each model, and then outline the differences between the models. We conclude with a discussion of factors to consider when choosing between the two models.


General References: Meta-Analysis

Conclusion: In summary, many of the commonly used methods for meta-analysis give inappropriate answers when data are sparse. The choice of the most appropriate method depends on the anticipated background event rate and structure of the trials. No method gives completely unbiased estimates in any circumstance when events are rare. At event rates below 1 per cent the Peto one-step odds ratio method appears to be the least biased and most powerful method, and provides the best confidence interval coverage, provided there is no substantial imbalance in treatment and control group sizes within trials, and treatment effects are not exceptionally large. In other circumstances the MH OR without zero-cell corrections, logistic regression and the exact method perform similarly to each other, and are less biased than the Peto method. (p. 75)





Appendix A: Percentage of Canadian Adolescents in Sexual Minority Categories.




Sexual Minority Demographics: 17 High Schools in
Three Canadian Cities: Toronto, Kingston, and Montreal *
Categories
Heterosexual
Gay /
Lesbian
Bisexual
Questioning
N's (% of Total =
3,636 Students)
3,506
(96.4%)
12
(0.33%)
50
(1.37%)
68
1.87%
Sex: M (Males),
F (Females)
M
F
M
F
M
F
M
F
n's: Totals =
Males = 1,708
Females: 1,928
1,648 **
1,858 **
9
3
15
35
36
32
Percentage
in Category
47%
53%
75%
25%
30%
70%
52.9%
47.1%
Percentage in
Sex Category
96.48%
96.37%
0.52%
0.15%
0.88%
1.81%
2.11%
1.66%

Gay / Lesbian
& Bisexual %
Gay / Bisexual Males:
1.40%
Lesbian / Bisexual Females:
1.96%

Data Source: Williams et al. (2003)
*Anonymous pencil & paper survey.
** Numbers not given. Calculated from given percentages.

Gay and Bisexual identified males account for 1.4% of males.



Montreal High Schools Survey: 2004 *
Categories
100%
Hetero-
sexual
Hetero.
ID. Some
Homo.
Gay /
Lesbian **
Bi-
sexual

ID
Unsure
N's, Total:
1,856
1,624
87.5%
115
6.20%
7
0.37%
51
2.75%
59
3.18%
Sex: M (Males),
F (Females)
M
F
M
F
M
F
M
F
M
F
n's, Males = 941
Females = 915
867
757
30
85
2
5
18
33
24
35
Percentage
in Category
53.4%
46.6%
26.1%
73.9%
28.6%
71.4%
35.3%
64.7%
40.7%
59.3%
Percentage in
Sex Category
91.1%
82.7%
3.2%
9.3%
0.21% 0.55%
1.9%
3.6%
2.5%
3.8%

Gay / Lesbian
& Bisexual %
Gay / Bisexual Males:
2.11%
Lesbian / Bisexual Females:
4.15%

Data Source: Zhao et al. (2010)
* Anonymous pencil & paper survey of 1,856 students 14 years of age and older from 14 public and private high schools in Montréal, Québec.
** Male and female counts in category not given in paper. Obtained from author.

Gay and Bisexual identified males account for 2.11% of males.




North American National Adult Surveys
Categories
Sex
n's *
Heterosexual
Gay or
Lesbian
Bisexual Unsure

 National Epidemiologic Survey on Alcohol and Related Conditions Wave 2: 2004/05: USA
Bolton &
Sareen (2011)
M
14,481
14,109
97.4%
190
1.31%
81
0.56%
101
0.70%
F
19,896
19,489
97.95%
145
0.73%
161
0.81%
101
0.51%
Gay / Lesbian
& Bisexual %

Gay / Bisexual Males:
1.87%
Lesbian / Bisexual Females:
1.54%

 Canadian Community Health Survey data: Cycle 2.1, 2003
Brennan
et al. (2010)

M
49,901
49,065
98.3%
536
1.07%
300
0.60%
---
Steele
et al. (2009)

F
61,715
60,937
98.74%
354
0.57%
424
0.69
---
Gay / Lesbian
& Bisexual %
Gay / Bisexual Males:
1.67%
Lesbian / Bisexual Females:
1.26%

Gay / Lesbian
& Bisexual %
2 Studies
Gay / Bisexual Males:
1.67% + 1.87% = 3.54% / 2 =
1.77%
Lesbian / Bisexual Females:
1.26% + 1.54% = 2.8% / 2 =
1.40%
* M: Males - F: Females




Brennan DJ, Ross LE, Dobinson C, Veldhuizen S, Steele LS (2010). Men's sexual orientation and health in Canada. Canadian Journal of Public Health, 101(3): 255-8. Abstract. PDF Download. Full Text.

Bolton S-L, Sareen J (2011). Sexual Orientation and Its Relation to Mental Disorders and Suicide Attempts: Findings From a Nationally Representative Sample. Canadian Journal of Psychiatry, 56(1): 35-43. Full Text. Full Text.

Steele LS, Ross LE, Dobinson C, Veldhuizen S, Tinmouth JM (2009). Women's sexual orientation and health: results from a Canadian population-based survey. Women Health, 49(5): 353-67. PubMed Abstract.

Williams T, Connolly J, Pepler D, Craig W (2003)
. Questioning and sexual minority adolescents: high school experiences of bullying, sexual harassment and physical abuse. Canadian Journal of Community Mental Health (Revue canadienne de santé mentale communautaire), 22(2): 47-58.
PubMed Abstract.  Summary.

Zhao Y, Montoro R, Igartua K, Thombs BD (2010)
. Suicidal Ideation and Attempt Among Adolescents Reporting "Unsure" Sexual Identity or Heterosexual Identity Plus Same-Sex Attraction or Behavior: Forgotten Groups? Journal of the American Academy of Child and Adolescent Psychiatry, 49(2): 104-113.
Abstract. PDF Download.





References: Gender Nonconformity & Harassment

Bailey JM, Zucker KJ (1995). Childhood sex-typed behavior and sexual orientation: A conceptual analysis and quantitative review. Developmental Psychology, 31(1): 43-55. Abstract.

Bell AP, and Weinberg MS (1978). Homosexualities: a study of diversity among men & women. New York: Simon and Schuster.

Bering J (2010). Is your child a "prehomosexual"? Forecasting adult sexual orientation. Scientific American. Full Text.


Espelage DL, Aragon SR, Birkett M (2008). Homophobic teasing, psychological outcomes, and sexual orientation among high school students: What influence do parents and schools have?  School Psychology Review [Special issue], 37: 202-216. Abstract. Full Text. Full Text. Download Page.

Harland, Ken (2009). Acting Tough: Young Men, Masculinity and the Development of Practice in Northern Ireland. Keynote Address, Men’s Health Conference, 2008. Nowhere Man Press. Full Text. - Related 2008 PowerPoint Presentation: Acting Tough: Young men, Masculinity and Lessons from Practice in Northern Ireland. Full Text. Author Bibliography.

Lippa RA (2008). The Relation Between Childhood Gender Nonconformity and Adult Masculinity–Femininity and Anxiety in Heterosexual and Homosexual Men and Women. Sex Roles, 59: 684–693. Abstract.

Lunn DJ, Thomas A, Best N, Spiegelhalter D (2000). WinBUGS - a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10, 325-337.

Menon M (2011). Does Felt Gender Compatibility Mediate Influences of Self-Perceived Gender Nonconformity on Early Adolescents’ Psychosocial Adjustment? Child Development. Article first published online. Abstract.

Plöderl M, Fartacek R (2009). Childhood Gender Nonconformity and Harassment as Predictors of Suicidality among Gay, Lesbian, Bisexual, and Heterosexual Austrians. Archives of Sexual Behavior, 38: 400–410. PubMed Abstract.

Rees-Turyn AM, Doyle C, Holland A, Root S (2008). Sexism and sexual prejudice (homophobia): The impact of the gender belief system and inversion theory on sexual orientation research and attitudes toward sexual minorities. Journal of LGBT Issues in Counseling, 2(1): 2-25. Related Information in: Rees AM, Doyle C, Miesch J (2006). Sexual Orientation, Gender Role Expression, and Stereotyping:  The Intersection Between Sexism and Sexual Prejudice (Homophobia). VISTAS Online: Full Text.

Remafedi G, Farrow JA, Deisher RW (1991). Risk factors for attempted suicide in gay and bisexual youth. Pediatrics, 87(6): 869-75. PubMed Abstract. Tabulated Results.

Renaud J, Berlim MT, Begolli M, McGirr A, Turecki G (2010). Sexual orientation and gender identity in youth suicide victims: an exploratory study. Canadian Journal of Psychiatry, 55(1): 29-34. PubMed Abstract. Full Text.

Shaffer D, Fisher P, Parides M, Gould M (1995). Sexual orientation in adolescents who commit suicide. Suicide and Life-Threatening Behavior, Supplement 25: 64-71. PubMed Abstract.

Tremblay P, Ramsay R (2002). Statistical results generated via SPSS-10 by Thomas G. Albright, Systems Analyst/Programmer at the Kinsey Institute for Research in Sex, Gender and Research. January 9 and 11, 2002. Tabulated Results.

Yunger JL, Carver PR, Perry DG (2004). Does gender identity influence children's psychological well-being? Developmental Psychology, 40(4): 572-82. PubMed Abstract.

Wade L (2011)
. If We’re Born Gay, How Would We Know? Full Text.

Waldman, Ari Azra (2012). Tormented: Anti-Gay Bullying in Schools. Temple University Law Review, 84: 385-42.
Full Text.



Additional References

Månsdotter A, Lundin A, Falkstedt D, Hemmingsson T (2009). The association between masculinity rank and mortality patterns: a prospective study based on the Swedish 1969 conscript cohort. Journal of Epidemiology and Community Health, 63(5): 408-13. Abstract.

Sandfort TG, Melendez RM,  Diaz RM (2007)
. Gender nonconformity, homophobia, and mental distress in Latino gay and bisexual men. Journal of Sex Research, 44(2): 181–189. 
Full Text Full Text. PubMed Abstract.

Skidmore WC, Linsenmeier JAW, Bailey JM (2006). Gender nonconformity and psychological distress in lesbians and gay men. Archives of Sexual Behavior, 35(6): 685-697.
PubMed Abstract.

Toomey RB, Ryan C, Diaz RM, Card NA, Russell ST (2010). Gender-nonconforming lesbian, gay, bisexual, and transgender youth: school victimization and young adult psychosocial adjustment. Developmental Psychology, 46(6): 1580-9. PDF Download. PubMed Abstract.

Gordon AR, Meyer IH (2007)
. Gender nonconformity as a target of prejudice, discrimination, and violence against LGB individuals. Journal of LGBT Health Research, 3(3): 55-71.
Full Text. PubMed Abstract.

Rieger G, Linsenmeier JA, Gygax L, Bailey JM (2008). Sexual orientation and childhood gender nonconformity: evidence from home videos. Developmental Psychology, 44(1): 46-58. Abstract.



Habib MK, Magruder-Habib KM, Kupper LL (1987/1988). A Review of Sample Size Determination in Comparative Studies. Institute of Statistics Mimeo Series No. 1840 (1987) & The Center for Computational Statistics Technical Report Series - TR 24 (1988). Reference. Full Text. Describes Fisher Unconditional Test.

Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S (2009). Hypothesis testing, type I and type II errors. Industrial Psychiatry Journal, 18(2): 127-31.
Full Text. PubMed Abstract.





Confidence Intervals

Dann RS, Kock GG (2005)
. Review and evaluation of methods for computing confidence intervals for the ratio of two proportions and considerations for non-inferiority clinical trials. Journal of Biopharmaceutical Statistics, 15(1): 85-107. Abstract.
"Problems due to small event rates were avoided because of use of methods for exact odds ratios for counts less than or equal to 3. Because of this modification, the performance of the methods may be slightly different from that presented previously in the literature." (p. 102)

Martin D, Austin H (1991). An efficient program for computing conditional maximum likelihood estimates and exact confidence limits for a common odds ratio. Epidemiology. 2(5): 359-362. Abstract. This is reference given for OpenEpi calculator.

Mehta CR, Patel NR (1997). Exact Inference for Categorical Data. Harvard University and Cytel Software Corporation. PDF Download.
One way to make valid statistical inferences in the presence of small, sparse or unbalanced data is to compute exact p-values and con fidence intervals, based on the permutational distribution of the test statistic. (p. 1)

Rogers JR, Lester D (2010). Understanding Suicide: Why We Don't and How We Might. Cambridge, MA: Hogrefe Publishing.
Hogrefe Publishing. Amazon. Book Review.
Recommendation 2.17: Researchers must continue to move beyond a sole reliance on statistical significance in interpreting quantitative research in suicidology to address issues of the clinical and practical usefulness of their results. (p. 22)

In Chapter 2: General Methodological Issues.

"Suicidologists have had a great difficulty in identifying meaningful correlates and predictors of suicidal behavior. Because of this Neuringer and Kolstoe (1966) suggested adopting less stringent criteria for statistical significance in suicide research, perhaps allowing rejection of the null hypothesis at the 10% level instead of the 5% level. This is an intriguing idea which has never been followed up, but it would result in the appearance of a larger proportion of "significant" results that were never replicated." (p. 21)

Shoval G, Mansbach-Kleinfeld I, Farbstein I, Kanaaneh R, Lubin G, Apter A, Weizman A, Zalsman G (2012)
. Self versus maternal reports of emotional and behavioral difficulties in suicidal and non-suicidal adolescents: An Israeli nationwide survey. European Psychiatry, [Epub ahead of print].
Abstract.
"This study demonstrated that mothers in the community are mostly unaware of the suicide ideation and attempts of their adolescents and hardly recognize their emotional and behavioral difficulties."

Neuringer C, Kolstoe RH (1966). Suicide research and the nonrejection of the null hypothesis. Perceptual & Motor Skills, 22: 115-118. Summary & First Page Excerpt.


To Top of Page
To the Home Page