Mendelian Randomisation

References

MR: Richmond & Smith (2022) Mendelian Randomisation: Concepts and Scope. Cold Spring Harbor Perspectives in Medicine, 12(1), a040501.

What is MR?

Mendelian Randomisation (MR) is a method that uses instrumental variables derived from genetic association data for a single trait, to test whether there is a causal effect of one trait on a seperate trait. The most common application of MR is using two seperate GWAS datasets for two different traits - a method known as two-sample MR.

MR Rationale

Mendelian inheritance concerns the random allocation of genetic variants to offspring independently of environmental factors and other genetic variables (with the exception of variants that are inherited in linkage disequlibrium).

Because this random allocation occurs at the population level, we can use genetic variants to identify groups of interest.
Using genetic variants as a proxy to identify a group of people that are likely to share a trait means we can avoid selecting people in a way that might introduce social or behavioural confounders.

How MR Works

MR is performed using a set of genetic instruments that are strongly associated with one trait - ‘Trait 1’. To find out if Trait 1 affects another trait - ‘Trait 2’ - MR can be applied to see if variants strongly associated with Trait 1 are proportionally associated with Trait 2.

MR Assumptions

There are three main assumptions that MR relies on:

While not all of these assumptions can be proven or disproven, they can be managed, and there are certain tests that can be used to assess the level of MR assumption violation:

Genetic variants that are strongly associated with Trait 1 can be derived using a GW significance level of 1e-8. The strength of the chosen variants can also be assessed by calculating an F-statistic (strong IVs are indicated by an F-stat>20).
Confounding can be managed by using GWAS datasets from homogeneous populations.
Assessing the contribution to heterogeneity of each variant can help reveal ‘outliers’: variants that disproportionately affect one trait, which may indicate pleiotropy. Horizontal pleiotropy can also be assessed using the MR-Egger test:

Triangulation of Evidence

It is important to note that even if a casual effect is identified by MR, this finding should be replicated and validated using different datasets and different methods. For example, if a causal effect was identified between a molecular trait and a disease outcome, RNAseq or proteomic analysis of tissues from patients with the disease could aid validation of an association between increased/decreased molecular levels and the disease.

MR can be performed on any two traits that have been analysed using a GWAS. The following example looks at whether plasma triglyceride level might have a causal effect on systolic blood pressure.

1. Packages

There are two useful R packages for performing MR - TwoSampleMR and RadialMR - which need to be loaded:

library(RadialMR)
library(TwoSampleMR)

2. Loading Data

First both GWAS datasets are loaded into R, and any information that is missing is added:

# Set wd
setwd(('/Users/alicesmail/Desktop/KCL/LCV/TriGBPs'))

# Get GWAS data
options(datatable.fread.datatable=FALSE)
TriG <- data.table::fread("./GWASFiles/TG_with_Effect.tbl", header=TRUE, sep="\t", stringsAsFactors = FALSE)
sBP <- data.table::fread("./GWASFiles/GCST90029011_buildGRCh37.tsv", header=TRUE, sep="\t", stringsAsFactors = FALSE)

# Add phenotype information
TriG$Phenotype <- 'TriG'
TriG$samplesize <- 96598
sBP$Phenotype <- 'sBP'

3. Filtering Exposure Data

Next the exposure GWAS is filtered to obtain SNPs that are strongly associated with plasma triglyceride level:

# Get instrumental variables for TriG
TriGSig <- TriG %>% filter(GC.Pvalue <= 1e-8)

The TwoSampleMR function format_data() formats the data for MR:

# Format exposure data
TriGMR <- TwoSampleMR::format_data(snps=NULL, type='exposure', TriGSig, header=TRUE,
                                   phenotype_col='Phenotype', snp_col='MarkerName',
                                   beta_col='Effect', se_col='StdErr',
                                   effect_allele_col='Allele1', other_allele_col='Allele2',
                                   pval_col='GC.Pvalue', 
                                   samplesize='samplesize')

4. Clumping

Clumping is then performed on the filtered and formatted data: this is the removal of SNPs in high LD so that the most significant SNP per region is retained. This is a very important step: SNPs in high linkage disequilibrium with significant SNPs will introduce overcounting and will bias the regression. For MR, only independent SNPs should be kept. The MRC IEU API has a function that performs clumping on the set of SNPs:

# LD clumping
options(ieugwasr_api='gwas-api.mrcieu.ac.uk/')
TriGMRClumped <- clump_data(TriGMR, clump_kb=500, clump_r2=0.01)

5. Removing the MHC Region

Another step to take here is removing any MHC SNPs, due to the complex LD patterns in this region. The exposure GWAS does not contain information on chromosome & position, so they can be filtered out of the outcome data instead:

# Get SNPs in region
MHC <- sBP %>%
  filter(chromosome==6) %>%
  filter(base_pair_location >= 28510120 & base_pair_location <= 33480577)
MHCSNP <- MHC %>% dplyr::select(variant_id)

# Remove
sBPNoMHC <- sBP %>% filter(!variant_id %in% unlist(MHCSNP))

Next use TwoSampleMR to format the outcome GWAS:

# Format outcome data
sBPMR <- TwoSampleMR::format_data(snps=TriGMRClumped$SNP, type='outcome', sBPNoMHC, header=TRUE,
                                   phenotype_col='Phenotype', snp_col='variant_id',
                                   beta_col='beta', se_col='standard_error',
                                   effect_allele_col='effect_allele', other_allele_col='other_allele',
                                   pval_col='p_value', 
                                   samplesize='n')

6. Harmonisation

harmonise_data() harmonises the two datasets: this makes sure the alleles across both datasets are aligned correctly.

# Harmonise
harm <- harmonise_data(exposure_dat=TriGMRClumped, outcome_dat=sBPMR)

7. MR

There are several different methods that can be used to perform MR.

The main MR method is the inverse variance weighted (IVW) method, which is a regression weighted by the outcome SE, with the intercept constrained to 0.
The MR-Egger method is similar, but the intercept is unconstrained: this can give a good indication of whether pleiotropy is present.
Median and mode based methods perform MR on a subset of the SNPs, meaning they are more robust to any MR violations.
In general, the IVW method is the main MR result, while other methods can be used to check that the effect identified by the IVW method is somewhat consistent under different parameters.

# MR
res <- mr(harm)
knitr::kable(res)

id.exposure	id.outcome	outcome	exposure	method	nsnp	b	se	pval
1QhhMh	YcFEc9	sBP	TriG	MR Egger	23	-0.0138922	0.0385138	0.7219214
1QhhMh	YcFEc9	sBP	TriG	Weighted median	23	0.0301564	0.0145819	0.0386332
1QhhMh	YcFEc9	sBP	TriG	Inverse variance weighted	23	0.0308218	0.0208680	0.1396787
1QhhMh	YcFEc9	sBP	TriG	Simple mode	23	0.0038519	0.0290564	0.8957403
1QhhMh	YcFEc9	sBP	TriG	Weighted mode	23	0.0531585	0.0147576	0.0015834

The results from the IVW method suggest that there is not a causal relationship between plasma triglyceride level and systolic blood pressure, with a beta effect that is not significantly different from 0.

8. Sensitivity Analyses

After performing MR, different sensitivity methods can be implemented to assess heterogeneity and pleiotropy.

# Heterogeneity
het <- mr_heterogeneity(harm)

# Single SNP
singleSNP <- mr_singlesnp(harm)
I2 <- Isq(singleSNP$b, singleSNP$se)

# Pleiotropy
pleio <- mr_pleiotropy_test(harm)

# Print
knitr::kable(het)

id.exposure	id.outcome	outcome	exposure	method	Q	Q_df	Q_pval
1QhhMh	YcFEc9	sBP	TriG	MR Egger	150.1872	21	0
1QhhMh	YcFEc9	sBP	TriG	Inverse variance weighted	163.6190	22	0

knitr::kable(pleio)

id.exposure	id.outcome	outcome	exposure	egger_intercept	se	pval
1QhhMh	YcFEc9	sBP	TriG	0.0036629	0.0026728	0.1850226

message('I2: ', round(I2[1], 4))

## I2: 0.8545

These sensitivity analyses show that this MR analysis has substantial heterogeneity - the Q value is very large and significant, and an I2 of 86% is very high. On the other hand, the MR-Egger intercept test is not suggestive of pleiotropy.

A mean F-statistic for the instrumental variables can assess their strength:

# Individual f-statistics
dat <- harm %>% filter(mr_keep=TRUE)
total_F <- 0
num <- 0
for (row in 1:nrow(dat)){
  SNP = dat[row,]
  Fstat <- (SNP$beta.exposure**2)/(SNP$se.exposure**2)
  total_F <- total_F+Fstat
  num <- num+1
}

# Mean f-statistic
FstatMean <- total_F/nrow(dat)
message(round(FstatMean,3))

## 146.025

9. Plots

As well as sensitivity analyses, there are some plots that can be used to inspect the MR results.

# Scatter plot
scatter <- mr_scatter_plot(res, harm)
scatter[[1]] + theme + guides(color=guide_legend(ncol=1))

This scatter plot is a good way to examine the different MR regression methods. The IVW and MR-Egger results have different directions of effect, and roughly half of the SNPs have opposing effects on plasma triglyceride level and systolic blood pressure.

Another useful plot is the leave-one-out plot, which can help identify any SNPs that have a large effect on the result.

# LOO plot
LOO <- mr_leaveoneout(harm, method=mr_ivw_fe)
mr_leaveoneout_plot(LOO)[[1]] + theme + theme(legend.position="none")

10. Outliers

Something else to do is examine if there are any outliers in the analysis - outliers are SNPs that substantially contribute to the overall heterogeneity. RadialMR can identify any outliers:

# Identify outliers
radial <- harm %>% filter(mr_keep==TRUE) %>% dat_to_RadialMR
radialIVW <- ivw_radial(radial[[1]], alpha=0.05/nrow(radial[[1]]))

## 
## Radial IVW
## 
##                    Estimate   Std.Error  t value     Pr(>|t|)
## Effect (Mod.2nd) 0.03090395 0.020873880 1.480508 1.387377e-01
## Iterative        0.03090438 0.020873910 1.480527 1.387328e-01
## Exact (FE)       0.03290962 0.007680747 4.284689 1.829946e-05
## Exact (RE)       0.03110524 0.018924598 1.643641 1.144669e-01
## 
## 
## Residual standard error: 2.719 on 22 degrees of freedom
## 
## F-statistic: 2.19 on 1 and 22 DF, p-value: 0.153
## Q-Statistic for heterogeneity: 162.6372 on 22 DF , p-value: 1.914582e-23
## 
##  Outliers detected 
## Number of iterations = 3

# Radial plot
plot_radial(radialIVW, radial_scale=FALSE, show_outliers=TRUE) + theme

Then we can remove the outliers we identified and see how our results are affected.

# Remove outliers
outliers <- radialIVW$outliers
TriGMRClumpedrmOutliers <- TriGMRClumped %>% filter(!SNP %in% outliers$SNP)

# Re-harmonise data
harm <- harmonise_data(exposure_dat=TriGMRClumpedrmOutliers, outcome_dat=sBPMR)

# MR
res <- mr(harm)
knitr::kable(res)

id.exposure	id.outcome	outcome	exposure	method	nsnp	b	se	pval
1QhhMh	YcFEc9	sBP	TriG	MR Egger	19	0.0383116	0.0280671	0.1900468
1QhhMh	YcFEc9	sBP	TriG	Weighted median	19	0.0348414	0.0147743	0.0183620
1QhhMh	YcFEc9	sBP	TriG	Inverse variance weighted	19	0.0250642	0.0142645	0.0789006
1QhhMh	YcFEc9	sBP	TriG	Simple mode	19	0.0027901	0.0368017	0.9404040
1QhhMh	YcFEc9	sBP	TriG	Weighted mode	19	0.0599157	0.0161682	0.0016178

As these outliers are identified by their contribution to the overall heterogeneity of the analysis, when they are removed the overall heterogeneity should be reduced:

# Heterogeneity
het <- mr_heterogeneity(harm)

# Single SNP
singleSNP <- mr_singlesnp(harm)
I2 <- Isq(singleSNP$b, singleSNP$se)

# Print
knitr::kable(het)

id.exposure	id.outcome	outcome	exposure	method	Q	Q_df	Q_pval
1QhhMh	YcFEc9	sBP	TriG	MR Egger	56.27195	17	4.3e-06
1QhhMh	YcFEc9	sBP	TriG	Inverse variance weighted	57.28027	18	5.6e-06

message('I2: ', round(I2[1], 4))

## I2: 0.6521

While the heterogeneity is not as extreme as before, it is still substantial, meaning the results from this analysis may be biased.

Results

In conclusion, while the IVW results from this analysis are borderline significant (p=0.08), there isn't enough evidence to say that plasma triglyceride levels causally effect systolic blood pressure. Additionally, even after removing outliers there is a substantial amount of heterogeneity in the effects of each SNP on each trait, which makes these results less reliable.

Some positives of this analysis are that the instrumental variables are sufficiently strong (F-stat=146), and that the exposure trait (plasma triglyceride levels) could be sufficiently modelled in MR by several strongly associated SNPs.

MR & LCV

The LCV model is a similar method to MR that aims to identify causal effects between two traits using GWAS data. The results from applying the LCV model to the same GWAS datasets are here.

Why do the results differ?

As the LCV model takes into account all SNPs from the input GWAS, it may be better powered to detect causal effects: MR only uses selected variants associated with exposure trait. It may be the case that this MR analysis was underpowered, relative to applying LCV, to detect a casual effect.