Linkage Disequilibrium Approximate Bayes Factor (LD-ABF)

Toy Example

Balancing selection maintains multiple haplotypes in a population due to one or mutliple evolutionary pressures. It not only results in increased polymorphism at the allele directly under evolutionary pressure, but surrounding variants on the same haplotypes will also rise in frequency in a process known as hitchhiking. Our test for balancing selection, linkage disequilibrium approximate Bayes Factor (LD-ABF), looks for both the strength of linkage disequilibrium and density of polymorphisms. To test for association between a given variant and neighboring variants let's first consider just testing the association between haplotypes of one variant versus one other close by neighboring variant.

LDABF_Fig_1.jpg

Take test variant $x_i$, $x_i=\lbrace0,1\rbrace$ where 0 corresponds to the major allele and 1 the minor allele for individual $i$ with neighboring variant $y_i=\lbrace0,1\rbrace$, again corresponds to the major or minor allele. A logistic regression model is a natural choice for the binary outcomes.

$$ p(y_i=1|x_i)=logit^{-1}(\beta_{0j}+\beta_{1j}x_i)$$

Where $\beta_{ij}$ corresponds to the log odds ratio of observing the alternate allele for neighboring variant $j$ given we observe the alternate allele for the test variant. A standard frequentist approach may run into issues, it is common to see some SNPs in perfect LD or near perfect LD, this creates complete or quasi-complete separation–or rare variants lead to sparsity comparing linkage which can also results in non-identifiability of the model.

Wakefield’s approximate Bayes factor method, (Wakefield 2009; 2007) which only requires summary statistics and has been used in a variety of case control Genome Wide Association Study settings (Tachmazidou et al. 2019; Wuttke et al. 2019), we again test for association between a test SNP and neighboring variant. The Bayes factor (Kass and Raftery 1995) is defined as:

$$BF=\frac{P(y|M_1)}{P(y|M_0)}$$

Where instead of the outcome being disease status, in our setting it is the neighboring allele being the minor or major allele on the haplotype. Again, we compare the intercept only model $M_0=logit^{-1}(\beta_{0j})$ versus the model with the neighboring variant having some association $M_1=logit^{-1}(\beta_{0j}+\beta_{1j}x_i)$

Let’s say we’re looking at two alleles, A and B, where A is our test variant and B is our neighboring comparison variant. You could either think about it in terms of vectors of binary indicators if we see the minor allele of A or B.

Observe an ‘a’ allele Observe a ‘b’ allele
0 0
0 0
0 0
1 0
0 0
0 0
1 1
0 0
1 1
0 0
In [ ]:
import pandas as pd
testCol='test'
neighborCol='neighbor'
data=pd.DataFrame({testCol:[0,0,0,1,0,0,1,0,1,0], neighborCol:[0,0,0,0,0,0,1,0,1,0]})
testAllele=data[testCol]
neighborAllele=data[neighborCol]
print(data)
   test  neighbor
0     0         0
1     0         0
2     0         0
3     1         0
4     0         0
5     0         0
6     1         1
7     0         0
8     1         1
9     0         0

Or you could consider a two-by-two table, our example this would coincide with:

A a
B 7 1 8
b 0 2 2

Then more generally you could think of it as:

A a
B $O_{11}$ $O_{12}$ $O_{1.}$
b $O_{21}$ $O_{22}$ $O_{2.}$
In [ ]:
O_11 = data[(data[testCol] == 0) & (data[neighborCol] == 0)].shape[0]
O_12 = data[(data[testCol] == 1) & (data[neighborCol] == 0)].shape[0]
O_21 = data[(data[testCol] == 0) & (data[neighborCol] == 1)].shape[0]
O_22 = data[(data[testCol] == 1) & (data[neighborCol] == 1)].shape[0]
print([O_11, O_12, O_21, O_22])
[7, 1, 0, 2]

We can then consider the maximum likelihood estimates for coefficients by setting up a logistic model with a single covariate for binary exposure, as in our situation: $$\hat{\beta}_{1j}=log\left( \frac{O_{11}O_{22}}{O_{21}O_{12}}\right)=log\left(odds\;ratio\right)$$

$$var\left(\hat{\beta}_{1j}\right)=\frac{1}{O_{11}}+\frac{1}{O_{12}}+\frac{1}{O_{21}}+\frac{1}{O_{22}}$$

The coefficient for the one SNP predicting the other, $\hat{\beta}_{1j}$ , is the log odds ratio. As a side note, notice what happens to $\hat{\beta}_{1j}$ if you flip $O_{11}$ and $O_{22}$ or the off diagonal terms? This symmetry will come into play later in terms of how to optimize. However, in our scenario of both sparse or perfect LD the estimates of the coefficients and their corresponding variances will potentially explode–try plugging zero into one or two values in the two-by-two table (say the case above or you have perfect LD so the off-diagonal values are both zero) and notice this would result in corresponding zeroes in the denominators of the MLE estimates. Now, in all transparency, logistic regression solvers typically actually iteratively solve for the estimates instead of explicitly solving the system of equations; however, the instability in the parameters estimates still arise.


Taking an example of $O_{11}=7, O_{12}=1, O_{21}=0 ,O_{12}=2$

$$\hat{\beta}_{1j}=log\left( \frac{7\times1}{2\times\color{red}0}\right)=log\left(\infty\right)$$

$$var\left(\hat{\beta}_{1j}\right)=\frac{1}{7}+\frac{1}{1}+\frac{1}{\color{red}0}+\frac{1}{2}$$


A useful set of techniques to overcome problems of separation and sparsity of data are penalization methods, having both frequentist and Bayesian foundations. Penalty functions are used to drive parameter estimates to zero by incurring a cost on including parameters in the model, where they are typically thought of as being effective in high dimensional settings. For example, a common penalized regression example is Lasso, $y_i~X{\boldsymbol{\beta}}+\lambda|\beta|$, where you add in another term because you’re in a high dimensional setting where you think many of the coefficients should actually have zero effect size–by introducing a penalty on the coefficients it shrinks them towards zero. Greenland and coauthors (Greenland and Mansournia 2015; Greenland 2007; 2003; Mansournia et al. 2018) have done extensive work looking at penalized functions or equivalent Bayesian techniques, especially in settings of binary outcomes, and the implications for reducing bias and mean square error (MSE) in settings of separation and sparsity. Greenland proposes using a class of loss functions proportional the information matrix, $r({\boldsymbol{\beta}})=ln(|I({\boldsymbol{\beta}})|)^m$. The penalized log likelihood can be written in the form: $$p(y_i|x_i,{\boldsymbol{\beta}})~l({\boldsymbol{\beta}})+\color{green}{\frac{m}{2}{\boldsymbol{\beta}}-mlog(1+e^{\boldsymbol{\beta}})}\tag{1}$$ The log approximate Bayes factor in this setting, plugging in equation (1), is: $$\log\left(BF_j\right)=\log\left\{\frac{p(y|M_{1,j})}{p(y|M_{0,j})}\right\}\approx\log\left(ABF_j\right)=\log \left\{\frac{L\left(D|\tilde{\beta_0},\tilde{\beta_1}\right)p\left(\tilde{\beta_0}\right)p\left(\tilde{\beta_1}\right)}{L\left(D|\tilde{\beta_0}\right)p\left(\tilde{\beta_0}\right)}\right\}$$ $$=\left\{l\left(\tilde{\beta_0}, \tilde{\beta_1}\right)+\frac{m}{2}\tilde{\beta_0}-mlog\left(1+e^{\tilde{\beta_0}}\right)+\frac{m}{2}\tilde{\beta_1}-mlog\left(1+e^{\tilde{\beta_1}}\right)\right\}-\left\{l\left(\tilde{\beta_0}+\frac{m}{2}\tilde{\beta_0}-mlog\left(1+e^{\tilde{\beta_0}}\right)\right)\right\} $$ Where $l({\boldsymbol{\beta}})$is the standard Bernoulli log likelihood for logistic regression and the other terms penalty terms a function of the ${\boldsymbol{\beta}}$ parameters being estimated. This has been demonstrated to be proportional to the posterior distribution with log F(m,m) priors, meaning estimating the function of the penalized log likelihood is equivalent to finding the posterior mode. The log F is in the conjugate family for binomial logistic regression, making it a natural choice in such settings. It can be easily seen that at m=0 is equivalent to the maximum likelihood estimate (MLE)–further at m=1 includes Jeffrey’s prior in the one parameter model, which was used in this setting based on the recommendations of Greenland and Mansournia (Greenland and Mansournia 2015). Using establish data augmentation techniques, a computationally efficient and robust way to get estimates of the posterior modes of the coefficients (Canario 2020).

In our scenario, let’s say we have a log F(m,m) prior, this can be estimated by augmenting our data set.

Observe an ‘a’ allele Observe a ‘b’ allele
0 0
0 0
0 0
1 0
0 0
0 0
1 1
0 0
1 1
0 0
0 0
0 1
1 0
1 1

Notice we no longer have an off diagonal zero, this will result in a more stable estimate of the coefficients and their corresponding variances. By fitting a standard logistic regression on the augmented data it gives the coefficient estimates with the corresponding penalties in equation (1). Similar to Cauchy (or t-distribution), the $log\,F$ priros, increase tail weight or skew the prior distribution, where the $log\,F$ can provide heavier tails than a multivariate normal but lighter tail than the Cauchy.

A a
B 7+1 1+1
b 0+1 2+1

To test for association between a test SNP and neighboring variant, we use the approximate Bayes factor. The Bayes factor (Kass and Raftery 1995) has been used in a variety of settings including in extensive use GWAS (Wakefield 2009; 2007; M. H. Chen et al. 2020; Maller et al. 2012). The approximate Bayes factor in this setting, plugging in equation (1), is: $$ABF_j=\frac{P(y|M_{1,j})}{P(y|M_{0,j})}=\frac {l(\beta_0,\beta_1)+\frac{m}{2}\beta_0-mlog(1+e^{\beta_0})+\frac{m}{2}\beta_1-mlog(1+e^{\beta_1})} {l(\beta_0)+\frac{m}{2}\beta_0-mlog(1+e^{\beta_0})}\tag{2}$$

Where the index $j$ is used to denote the current neighboring SNP we are looking at and later we'll expand this to other SNPs. If we were to set m to zero then we’d be left with a simple likelihood ratio test $\frac{l(\beta_0,\beta_1)}{l(\beta_0)}$. First, let's fit the interecept only model, $M_{0,j}= logit^{−1}(\beta_{0j})$.

In [ ]:
import statsmodels.api as sm
#########################################
###### Intercept Only Model (M_0,j)######
#########################################

#Data_X
## Add weight and intercept into original data X
weightCol="weight"
interceptCol= "intercept"
startingTestValues = pd.DataFrame(testAllele)
startingTestValues[interceptCol] = 1
startingTestValues[weightCol]=1
#Data_Y
startingNeighborValues = pd.DataFrame(neighborAllele)
#Add augmented data of AB and Ab into original dataset of X and Y
augmentX = pd.DataFrame({testCol: [0, 0], interceptCol: [0, 0]})
m=1  #log F(m,m) priori where m=1
##Add weight for augmented data
augmentX[weightCol] = m / 2.0
augmentedInput = startingTestValues.append(augmentX, ignore_index=True)
augmentY = pd.DataFrame({neighborCol: [0, 1]})
augmentedOutput = startingNeighborValues.append(augmentY, ignore_index=True)
print(augmentedInput)
print(augmentedOutput)

#Run logist regression
logitIntercept_M0 = sm.Logit(augmentedOutput, augmentedInput[interceptCol],
                        freq_weights=augmentedInput[weightCol],
                            disp=int(False)).fit()
    test  intercept  weight
0      0          1     1.0
1      0          1     1.0
2      0          1     1.0
3      1          1     1.0
4      0          1     1.0
5      0          1     1.0
6      1          1     1.0
7      0          1     1.0
8      1          1     1.0
9      0          1     1.0
10     0          0     0.5
11     0          0     0.5
    neighbor
0          0
1          0
2          0
3          0
4          0
5          0
6          1
7          0
8          1
9          0
10         0
11         1
Optimization terminated successfully.
         Current function value: 0.532527
         Iterations 5

Then fit model 1, $M_{1,j}= logit^{−1}(\beta_{0j}+\beta_{1j}x_i)$.

In [ ]:
#######################################################
###### Model with the neighboring variant (M_1,j)######
#######################################################

#Add augmented data of aB and ab into dataset of augmentedInput and augmentedInput
augmentX = pd.DataFrame({testCol: [1, 1], interceptCol: [0, 0]})
augmentX[weightCol]=m/2.0
augmentedInputFull= augmentedInput.append(augmentX,ignore_index=True)
augmentY = pd.DataFrame({neighborCol:[0,1]})
augmentedOutputFull=augmentedOutput.append(augmentY, ignore_index=True)
print(augmentedInputFull)
print(augmentedOutputFull)

#Run logist regression
logitFullModel_M1 = sm.Logit(augmentedOutputFull.astype(float), augmentedInputFull[[interceptCol, testCol]],
                freq_weights=augmentedInputFull[weightCol],disp=int(False)).fit()
    test  intercept  weight
0      0          1     1.0
1      0          1     1.0
2      0          1     1.0
3      1          1     1.0
4      0          1     1.0
5      0          1     1.0
6      1          1     1.0
7      0          1     1.0
8      1          1     1.0
9      0          1     1.0
10     0          0     0.5
11     0          0     0.5
12     1          0     0.5
13     1          0     0.5
    neighbor
0          0
1          0
2          0
3          0
4          0
5          0
6          1
7          0
8          1
9          0
10         0
11         1
12         0
13         1
Optimization terminated successfully.
         Current function value: 0.468765
         Iterations 6

Now we Compare the posterior of the intercept only model $M_{0,j}= logit^{−1}(\beta_{0j})$ versus the model with the neighboring variant $M_{1,j}= logit^{−1}(\beta_{0j}+\beta_{1j}x_i)$ to get the $ABF_j$.

In [ ]:
import numpy as np
#########################################
###### Intercept Only Model (M_0,j)######
#########################################
beta0inter = logitIntercept_M0.params[interceptCol]
yPred = np.exp(beta0inter) / (1 + np.exp(beta0inter))
logLikePerObsInter = startingNeighborValues * np.log(yPred) + (1 - startingNeighborValues) * np.log(1 - yPred)
logLikelihood_M0 = logLikePerObsInter.sum() + m / 2.0 * beta0inter - m * np.log(1 + np.exp(beta0inter))
#######################################################
###### Model with the neighboring variant (M_1,j)######
#######################################################
beta0 = logitFullModel_M1.params[interceptCol]
beta1 = logitFullModel_M1.params[testCol]
yPred=np.exp(beta1*startingTestValues[testCol] + beta0) / (1+np.exp(beta1*startingTestValues[testCol] + beta0))
logLikePerObs = (startingNeighborValues['neighboring']*np.log(yPred) + (1-startingNeighborValues['neighboring'])*np.log(1-yPred)).sum()
logLikelihood_M1 = logLikePerObs + m/2.0*beta0 - m*np.log(1+np.exp(beta0)) \
                        + m/2.0*beta1 - m*np.log(1+np.exp(beta1))
#logfABF for a pair of test and neighboring variants
logLogfABF_j = logLikelihood_M1 - logLikelihood_M0
#print(logLikelihood_M1)
#print(logLikelihood_M0[0])
print(logLogfABF_j[0])
-5.292883166575112
-5.920314967256034
0.6274318006809221

To get the test statistics across the entire neighboring region the product of these ABF between the test SNP and each neighboring SNP in a window of a thousand bases (five hundred bases up and downstream) were used, where the log is taken for computational ease. Monomorphic sites are considered to have uninformative ABF of 1. The final statistic is the log product of the ABF across the entire window, divided by the window size purely for interpretability plugging in equation (2).
$$log\,ABF=\frac{1}{W}log\prod_{j=1}^W \left[\frac{P(y|M_{1,j})}{P(y|M_{0,j})}\right]\tag{3}$$ LDABF_Fig2.jpg

So, the same steps as above would be repeated for each of the neighboring SNPs within the given window.

References

Canario, Remy. 2020. “The Best Classifier for Small Datasets : Log- F ( m , m ) Logit.” Medium, 1–15. https://medium.com/@remycanario17/log-f-m-m-logit-the-best-classification-algorithm-for-small-datasets-fc92fd95bc58.

Chen, Ming Huei, Laura M. Raffield, Abdou Mousas, Saori Sakaue, Jennifer E. Huffman, Arden Moscati, Bhavi Trivedi, et al. 2020. "Trans-Ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations." Cell. Vol. 182. https://doi.org/10.1016/j.cell.2020.06.045.

Greenland, Sander. 2003. “Generalized Conjugate Priors for Bayesian Analysis of Risk and Survival Regressions.” Biometrics 59 (1): 92–99. https://doi.org/10.1111/1541-0420.00011.

Greenland, Sander. 2007. “Bayesian Perspectives for Epidemiological Research. II. Regression Analysis.” International Journal of Epidemiology 36 (1): 195–202. https://doi.org/10.1093/ije/dyl289.

Greenland, Sander, and Mohammad Ali Mansournia. 2015. “Penalization, Bias Reduction, and Default Priors in Logistic and Related Categorical and Survival Regressions.” Statistics in Medicine 34 (23): 3133–43. https://doi.org/10.1002/sim.6537.

Kass, Robert E., and Adrian E. Raftery. 1995. “Bayes Factors.” Journal of the American Statistical Association 90 (430): 773–95. https://doi.org/10.1080/01621459.1995.10476572.

Maller, Julian B., Gilean McVean, Jake Byrnes, Damjan Vukcevic, Kimmo Palin, Zhan Su, Joanna M.M. Howson, et al. 2012. “Bayesian Refinement of Association Signals for 14 Loci in 3 Common Diseases.” Nature Genetics 44 (12): 1294–1301. https://doi.org/10.1038/ng.2435.

Mansournia, Mohammad Ali, Angelika Geroldinger, Sander Greenland, and Georg Heinze. 2018. “Separation in Logistic Regression: Causes, Consequences, and Control.” American Journal of Epidemiology 187 (4): 864–70. https://doi.org/10.1093/aje/kwx299.

Tachmazidou, Ioanna, Konstantinos Hatzikotoulas, Lorraine Southam, Jorge Esparza-gordillo, Valeriia Haberland, Jie Zheng, Toby Johnson, et al. 2019. “Identification of New Therapeutic Targets for Osteoarthritis through Genome-Wide Analyses of UK Biobank Data.” Nature Genetics 51 (February). https://doi.org/10.1038/s41588-018-0327-1.

Wakefield, Jon. 2007. “A Bayesian Measure of the Probability of False Discovery in Genetic Epidemiology Studies.” American Journal of Human Genetics 81 (2): 208–27. https://doi.org/10.1086/519024.

Wakefield, Jon. 2009. “Bayes Factors for Genome-Wide Association Studies: Comparison with P -Values.” Genetic Epidemiology 33 (1): 79–86. https://doi.org/10.1002/gepi.20359.

Wuttke, Matthias, Yong Li, Man Li, Karsten B. Sieber, Mary F. Feitosa, Mathias Gorski, Adrienne Tin, et al. 2019. “A Catalog of Genetic Loci Associated with Kidney Function from Analyses of a Million Individuals.” Nature Genetics 51 (6): 957–72. https://doi.org/10.1038/s41588-019-0407-x.