Linkage Disequilibrium

What is Linkage Disequilibrium?

Alleles that are located near each other in the genome are likely to be inherited together, while alleles that are positioned far apart are less likely to be inherited together. This phenomenon arises from recombination during meiosis: alleles that are closer together are less likely to be recombined.





This non-random association of alleles that are positioned near each other is known as linkage disequilibrium (LD), which can be quantified by computing the squared correlation (r2) of two SNPs.

Haplotypes

Haplotypes are groups of alleles across the genome that are in high LD, and they can be as large as hundreds of Kbp. Haplotype blocks are often inherited altogether: one way to identify haplotypes is to get the pairwise allele D or r2 scores for a block of alleles - usually, alleles are considered to be in LD if they have an r2>0.8.





r2: How is it calculated?

If we had two alleles, A & B, with population frequencies pA and pB, we could define the population frequency of the AB haplotype as pAB. If A and B are in linkage equilibrium, and are inherited independently of each other, then we would expect pA x pB = pAB. However, if pA x pB ≠ pAB, then the alleles are in LD.

We can use these values to estimate the extent of LD between two alleles:

Using LDlinkR

A useful package to obtain r2 values is LDlinkR. Using the LDmatrix() function, you can query a set of alleles using their rsIDs. rsIDs actually refer to SNPs, not alleles: they only specify the location of the allele, not which allele is there (ie A or T) - but we only actually need to know the location to get the r2 value anyway.

In the example below, I have 10 SNPs that are located in the chr9:10770000-chr9:10790000 region. We can query LDlinkR to try and find out if any of these SNPs are in LD with each other.

library(LDlinkR)
library(tidyverse)
# Define SNPs
SNPs <- c('rs16926435', 'rs16926436', 'rs16926439', 'rs16926451', 'rs140827421', 
          'rs7048649', 'rs6474593', 'rs6474594', 'rs7860440', 'rs7047976')

# Query LDlinkR
SNPsLD <- LDlinkR::LDmatrix(SNPs,
                            pop = "EUR",    
                            r2d = "r2", 
                            token = 'TOKEN')

Now we can inspect the LD across these 10 SNPs by pivoting the LD matrix using pivot_longer() and visualising the LD between SNP pairs using geom_tile():

# Set rownames as rsID to create matrix
rownames(SNPsLD) <- SNPsLD$RS_number
SNPsLD <- SNPsLD[-1]

# Convert to tibble and pivot longer
SNPsLDtbl <- SNPsLD %>% 
  rownames_to_column('X') %>% 
  as_tibble() %>%
  pivot_longer(-X, names_to = "Y", values_to = "value") %>%
  mutate(X = factor(X),
         Y = factor(Y))

# Heatmap
ggplot(SNPsLDtbl, aes(x=X, y=Y)) +
  theme_bw()+
  geom_tile(aes(fill = value, width=1, height=1), colour='black', linewidth=0.5) +
  geom_text(aes(label = round(value, 3)), size = 5, family='Radio Canada Big') +
  scale_fill_gradient(low = "#FAE36F", high = "#FF7540", limits=c(0.95,1), na.value="#FAE36F")+
  theme(legend.position='none', 
        legend.title=element_blank(),
        text=element_text(size = 16, family='Radio Canada Big'),
        panel.grid = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        axis.ticks=element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  xlab(NULL) + ylab(NULL)




From this heatmap we can see that one group of 4 SNPs are in very high LD, while a separate group of 5 SNPs are in very high LD. In fact, some SNP pairs, like rs16926436 & rs16926435, have an r2 of 1, meaning that the pair of alleles (like A & T or C & G) at these two locations is virtually always inherited together.

References

Stumpf & McVean (2003). Estimating recombination rates from population-genetic data. Nature Reviews. Genetics, 4(12), 959–968.

Bush & Moore (2012). Chapter 11: Genome-wide association studies. PLoS Computational Biology, 8(12), e1002822.

Myers, Chanock & Machiela (2020). LDlinkR: An R Package for Rapidly Calculating Linkage Disequilibrium Statistics in Diverse Populations. Frontiers in Genetics, 11(157).

Heidi Marika Hautakangas’s thesis (2018): https://helda.helsinki.fi/server/api/core/bitstreams/d1f408d5-7b4a-4f6a-84e8-6eba25b83467/content