The LCV model is a method that can be used to estimate whether two
traits are causally related, as well as the direction of any potential
causal relationship. It assumes a
I could hypothesise that higher plasma triglyceride levels cause higher systolic blood pressure, and apply the LCV method to two GWAS datasets: one for triglyceride levels, and one for systolic blood pressure. These are the steps that are needed:
The first step involves downloading and standardising two GWAS files. This example uses a GWAS for systolic blood pressure (GCST90029011) and GWAS for plasma triglyceride levels (GCST000758).
The commands required to perform this step can be found at github.com/alicesmail12/LDSC-LCV/tree/main/TriGsBP. Here is the output from Step 1:
## Heritability of phenotype 1
## ---------------------------
## Total Observed scale h2: 0.1372 (0.0242)
## Lambda GC: 1.0016
## Mean Chi^2: 1.1293
## Intercept: 0.8657 (0.0124)
## Ratio < 0 (usually indicates GC correction).
##
## Heritability of phenotype 2/2
## -----------------------------
## Total Observed scale h2: 0.2044 (0.0081)
## Lambda GC: 2.0007
## Mean Chi^2: 2.8409
## Intercept: 1.1489 (0.0259)
## Ratio: 0.0809 (0.0141)
##
## Genetic Covariance
## ------------------
## Total Observed scale gencov: 0.0184 (0.0036)
## Mean z1*z2: 0.0774
## Intercept: -0.0003 (0.0074)
##
## Genetic Correlation
## -------------------
## Genetic Correlation: 0.11 (0.0205)
## Z-score: 5.3559
## P: 8.5124e-08
From the LDSC analysis, I can see that:
However, I don’t actually need any of this information for LCV! Only the standardised GWAS output files are needed.
Using R, I can load both standardised GWAS outputs from LDSC, as well as the LD score data for each chromosome:
# Set wd
setwd(('/Users/alicesmail/Desktop/KCL/LCV/TriGBPs'))
# Get standardised data (LDSC output)
P1 = na.omit(read.table("./GWASFiles/TriG.sumstats.gz", header=TRUE, sep="\t", stringsAsFactors = FALSE))
P2 = na.omit(read.table("./GWASFiles/BPs.sumstats.gz", header=TRUE, sep="\t", stringsAsFactors = FALSE))
# Get LD scores
LDScores = read.table("./eur_w_ld_chr/1.l2.ldscore.gz", header=TRUE, sep='\t', stringsAsFactors=FALSE)
for (i in 2:22){
x <- read.table(paste0("./eur_w_ld_chr/",i,".l2.ldscore.gz"), header=TRUE, sep='\t', stringsAsFactors=FALSE)
LDScores <- rbind(LDScores, x)
}
# Merge LD scores with munged GWAS data
P = merge(P1, P2, by="SNP")
data = merge(LDScores, P, by="SNP")
# SNPs included in analysis
dim(data)[1]
## [1] 1134610
In this analysis, there are
# Sort by CHR & BP
data = data[order(data[,"CHR"], data[,"BP"]),]
# Flip sign of one z-score if opposite alleles
mismatch = which(data$A1.x!=data$A1.y,arr.ind=TRUE)
data[mismatch,]$Z.y = data[mismatch,]$Z.y*-1
data[mismatch,]$A1.y = data[mismatch,]$A1.x
data[mismatch,]$A2.y = data[mismatch,]$A2.x
# Assign LD & Z-scores to variables
LDScores <- data$L2
P1Zscores <- data$Z.x
P2Zscores <- data$Z.y
Now the Z-scores from both GWAS datasets across all 1,134,610 SNPs and their corresponding LD scores are ordered, we can perform a block jackknife.
source("/Users/alicesmail/Desktop/KCL/LCV/TriGBPs/MomentFunctions.R")
This analysis also needs a few defined variables:
# Block number for Jackknife
nBlocks <- 100
# Set intercepts and weights
crosstrait.intercept <- 1
ldsc.intercept <- 1
intercept.12 <- 0
# Get weights
weights <- 1/pmax(1,LDScores)
Prior to the jackknife, it is important to check that the SNPs & LD scores align:
# Get LD score length
nSNP=length(LDScores)
# Check LD score is same length as Z-scores
if(length(P1Zscores)!=nSNP||length(P2Zscores)!=nSNP){
stop('LD scores and summary statistics should have the same length')
}
Now I can generate a grid of all the possible values of the gcp estimate (from -1 to 1):
# Grid of all gcp values
grid <- (-100:100)/100
# Get size of blocks
size.blocks=floor(nSNP/nBlocks)
# Generate empty 8x100 matrix
jackknife=matrix(0, nBlocks, 8)
# Block size
print(size.blocks)
## [1] 11346
The size of each block of SNPs is 11,346 - roughly 1/100 of the total number of SNPs.
I can perform the jackknife procedure to obtain the mixed fourth moments for each block omission:
# For each jackknife, estimate rho, mixed fourth moments and heritability intercepts
for(jk in 1:nBlocks){
# Generate block of SNPs for jackknife
if(jk==1) {ind <- (size.blocks+1):nSNP}
else if(jk==nBlocks) {ind <- 1:(size.blocks*(jk-1))}
else {ind <- c(1:((jk-1)*size.blocks), (jk*size.blocks+1):nSNP)}
# Jackknife estimates
jackknife[jk,] <- EstimateK4(LDScores[ind], P1Zscores[ind], P2Zscores[ind],
crosstrait.intercept, ldsc.intercept, weights[ind],
sig.threshold=.Machine$integer.max, n.1=1, n.2=1, intercept.12, 8)
# Catch errors
if(any(is.nan(jackknife))){stop('NaNs produced.')}
}
Sometimes a jackknife can produce mixed fourth moments that are ‘outliers’ - when a specific block of SNPs is excluded in one jackknife, the estimates are much different than the estimates from other jackknifes. This can be checked for by making a scatter plot:
# Check for outlier blocks
jackknifedf <- as.data.frame(jackknife)
jackknifedf$colour <- NA
# Plot
ggplot(jackknifedf, aes(x=jackknifedf[,2], y=jackknifedf[,3], colour=colour))+
geom_point(size=4, alpha=0.5)+
theme+
ylab("Mixed Fourth Moments 1")+ xlab("Mixed Fourth Moments 2")+
guides(colour="none")+
theme(text = element_text(size = 16))
From the scatter plot, we can see that there are not any massive outliers. If there were any outliers, the jackknife blocks these correspond to could be removed to check if they affect the overall results.
It is good to check that the genetic correlation estimate from LCV is somewhat similar to the estimate from LDSC. These values won’t be exactly the same as LCV and LDSC use slightly different methods to estimate genetic correlation, and we are using a smaller set of SNPs.
# Get mean rho for rho estimate (genetic correlation)
rho.est <- mean(jackknife[,1])
# Get rho standard error for rho estimate (genetic correlation)
rho.err <- sd(jackknife[,1])*sqrt(nBlocks+1)
flip <- sign(rho.est)
message('Genetic correlation: ', round(rho.est, 4), ' (', round(rho.err, 4),')')
## Genetic correlation: 0.1239 (0.0233)
The genetic correlation estimate from LCV is 0.12, which is very similar to the LDSC estimate of 0.11.
Now we have our mixed fourth moments, we can compute the likelihood of each specific gcp value. The function below, from github.com/lukejoconnor/LCV, does this:
# Get mixed fourth moments
jackknife[,2:3] <- jackknife[,2:3]-3*jackknife[,1]
# Get dataframe corresponding to all possible gcp values
gcp.likelihood.trigbps <- grid
gcp.likelihood.trigbps[] <- 0
# For each value in the gcp grid get the likelihood
for(kk in 1:length(grid)){
# Compute likelihood
xx <- grid[kk]
fx <- abs(jackknife[,1])^(-xx)
numer <- jackknife[,2]/fx-fx*jackknife[,3]
denom <- pmax(1/abs(jackknife[,1]), sqrt(jackknife[,2]^2/fx^2+ jackknife[,3]^2*fx^2))
pct.diff <- numer/denom
gcp.likelihood.trigbps[kk] <- dt(mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1), nBlocks-2)
# Get p-value when gcp=1 and gcp=-1
if(xx==-1){pval.fullycausal.2 <- pt(-flip*mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1),nBlocks-2)}
if(xx==1){pval.fullycausal.1 <- pt(flip*mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1),nBlocks-2)}
# Get zscore when gcp=0 (tests whether gcp is different from 0)
if(xx==0){zscore <- flip*mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1)}
}
To inspect the computed likelihood estimates for each gcp value, we can create a heatmap as follows:
# gcp likelihood dataframe
gcp.likelihood.trigbps.df <- data.frame(gcp.likelihood.trigbps)
gcp.likelihood.trigbps.df$gcp.value <- (-100:100)/100
# Plot
ggplot(gcp.likelihood.trigbps.df, aes(x = gcp.value, y = 0.5, fill = gcp.likelihood.trigbps))+
geom_tile(color="black", size=1)+
geom_tile(fill='white')+
geom_tile(alpha=0.75)+
ylab("")+
theme+
scale_fill_gradientn(colours=colorRampPalette(c('#5f8fb0', '#93b572', '#e7d044', '#d10000'))(100),
na.value='#5f8fb0',
guide=guide_colorbar(frame.colour="black",
ticks.colour="black",
alpha=0.75,
title='Likelihood'))+
scale_y_discrete(expand=(c(0,0.8)))
From the heatmap, we can see that it is much more likely that the gcp value is nearer 1 than near -1. To get a single gcp estimate, we take the mean of all possible gcp values, weighted by their computed likelihood:
# Estimate gcp value: Get mean of all gcp values (weighted by likelihood)
gcp.pm.trigbps <- WeightedMean(grid, gcp.likelihood.trigbps)
# Estimate gcp value: estimate standard error
gcp.pse.trigbps <- sqrt(WeightedMean(grid^2, gcp.likelihood.trigbps)-gcp.pm.trigbps^2)
# Print
message('gcp estimate: ', round(gcp.pm.trigbps, 2), ' (', round(gcp.pse.trigbps, 2),')')
## gcp estimate: 0.79 (0.17)
A gcp estimate of 0.79 indicates a direction of causality from plasma triglyceride level to systolic blood pressure. In other words, the LCV model predicts that systolic blood pressure is affected by plasma triglyceride level - not the other way around.
How is the gcp value derived? As a simplified overview, we can look at the cokurtosis estimates for plasma triglyceride level to systolic blood pressure and systolic blood pressure to plasma triglyceride level - roughly, the ratio between the two indicates what direction the gcp estimate might go in (positive or negative):
# Create df
df <- data.frame(comparison=c('TriGsBP', 'sBPTriG'),
cokurtosis=c(mean(jackknife[,2]), mean(jackknife[,3])),
sd=c(sd(jackknife[,2]), sd(jackknife[,3])))
# Plot
ggplot(df, aes(x=comparison, y=cokurtosis))+
geom_col(fill=c('#5f8fb0', '#93b572'), colour='black', position = position_dodge(width = 0.5))+
theme +
xlab('CoKurtosis Direction') + ylab('CoKurtosis') +
geom_errorbar(aes(ymin = cokurtosis-sd,ymax = cokurtosis+sd), width = 0.2)
Here it can be seen that the cokurtosis estimate where plasma
triglyceride level is the primary variable is much larger than the
cokurtosis estimate where systolic blood pressure is the primary
variable. A large difference between these two cokurtosis measures
results in a large gcp value (near -1 or 1) - the ratio between these
two estimates gives us an idea of how our gcp estimate was derived.
After computing a gcp estimate, we can also perform a test on the null hypothesis that gcp=0:
# Estimate p-value (tests whether gcp is different from 0)
pval.gcpzero.2tailed.trigbps=pt(-abs(zscore),nBlocks-1)*2
# Print
message('P-value: ', formatC(pval.gcpzero.2tailed.trigbps, format = "e", digits = 2))
## P-value: 4.68e-47
A p-value of 4.68e-47 is highly significant, meaning that we can reject the null hypothesis that gcp=0.
It is also a good idea to check the heritability z-scores for both of the GWAS datasets we used: a z-score below 7 may indicate a trait that is not heritable enough for LCV to work correctly.
# Get mean z-scores for each GWAS
mean.zscore.1.trig <- mean(jackknife[,5])
mean.zscore.2.bps <- mean(jackknife[,6])
# Get h2 z-score
h2.zscore.1.trig <- mean(jackknife[,5])/sd(jackknife[,5])/sqrt(nBlocks+1)
h2.zscore.2.bps <- (mean(jackknife[,6])/sd(jackknife[,6]))/sqrt(nBlocks+1)
# Create df
h2.zscores <- data.frame('GWAS'=c('Triglycerides', 'sBP'), 'h2 Z-score'=c(h2.zscore.1.trig, h2.zscore.2.bps))
# Plot
ggplot(h2.zscores, aes(x=GWAS))+
geom_col(aes(y=h2.Z.score), fill=c('#5f8fb0', '#93b572'), alpha=0.75, colour='black')+
geom_hline(yintercept=7, colour='black', linetype='dashed')+
geom_hline(yintercept=4, colour='black', linetype='solid')+
scale_y_continuous(limits=c(0,40), expand=c(0,0), name=bquote(h^2~`Z-score`))+
theme
The graph indicates that both plasma triglyceride level and systolic blood pressure are heritable enough for the LCV model to work reliably.
By applying the LCV model to two different GWAS datasets for plasma triglyceride level and systolic blood pressure, we can conclude that plasma triglyceride level is likely to affect systolic blood pressure.