Localized Intolerance Model using Bayesian Regression (LIMBR) is a sub-regional (domains or exons) genic intolerance score. We fit a Bayesian hierarchical model explicitly characterizing depletion in functional variation at both the gene and sub-regional level. Figure 1 from our manuscript1 depicts the approximate geometric interpretation of this approach, the model was regressed on missense variation versus all variation.
The data on this website corresponds to our updated fit of the model on genome Aggregation Database (gnomAD) version v2.1 using 125,748 whole exome sequences. Two sets of scores are calculated by fitting different definitions of sub-regions across genes, once with genic sub-regions defined by exon boundaries and then again with sub-regions defined by functional domains, both using the Conserved Domain Database (CDD).2,3 The filtered data from gnomAD first had to go through ‘PASS’ criteria, or in this version we also allowed for SEGDUP variants to be included, then further restricted to regions with at least 10x coverage in at least 70% of the samples. Additionally, any genes without any variation were excluded from the analysis.
The number of missense variants within the ith sub-region of gene j, yij, as a function of the total number of variants within the ith sub-region of gene j, xij, through the following regression model,
yij = μ+αj + αi(j) + β1xij + e0ij,
We model αj and αi(j) as random effects within a Bayesian framework. We assume standard normal priors for αj and αi(j) but allow for separate prior variances for αi(j) for each gene j. We begin by assuming an inverse gamma prior for the variances. Specifically, we choose αj~N(0,σ2 ) with hyper-parameters σ2~InvGamma( ϵ,ϵ) and ϵ~Uniform(δ,c), where δ is a small positive constant and c is a large positive constant to induce a diffuse prior. For the sub-region parameters, we use a similar structure but with a separate variance for each gene, i.e., we choose αi(j)~N(0,σj2 ) with hyper-parameters σj2~InvGamma( ϵj,ϵj) and ϵj~Uniform(δ,c). Note that by allowing for a gene-level variance, the αijs can be shrunken back to the gene level intolerance when there are no large differences between sub-region or when data is sparse. This will decrease the variability of the αi(j)s, leading to more stable intolerance estimation.
To improve the ergodicity, αi(j) was set to zero for genes with 2 or fewer sub-regions, this is effectively just collapsing genes with only 2 sub-regions eliminating any inflated within versus across chain variance. Further, it is known that the hierarchical model can be augmented, sometimes referred to as noncentral4–6 or ancillary augmentation7, and similar methods are known to improve performance.8,9 So, we introduced an additional hyper parameter v~N(0,1) for α*ij=vσj to reduce autocorrelation.
yij = μ+αj + α*i(j) + β1xij + e0ij,
By introducing an auxiliary hyper parameter, the conditional variance structure is maintained while decoupling the random variables we wish to make inference on, Var(αij│σj2 )=Var( α*i(j)│v,σj2 ). The final score that is used for the analysis is the posterior mode of the combined genic and sub-region terms. The hierarchical model allows information to be shared across sub-regions, stabilizing intolerance estimates. A burn in of 1,000 with an additional 10,000 steps across 5 chains was run for both domains and exons.
For more details please see our article: Improved pathogenic variant localization via a hierarchical model of sub-regional intolerance
If you find the information or resources from this website useful in your research please cite our paper.
Hayeck, T. J., Stong, N., Wolock, C. J., Copeland, B., Kamalakaran, S., Goldstein, D. B., & Allen, A. S. (2019). Improved pathogenic variant localization via a hierarchical model of sub-regional intolerance. The American Journal of Human Genetics, 104(2), 299-309.