Latent Causal Variable

Applying the LCV Model

I could hypothesise that higher plasma triglyceride levels cause higher systolic blood pressure, and apply the LCV method to two GWAS datasets: one for triglyceride levels, and one for systolic blood pressure. These are the steps that are needed:

Identify two appropriate GWAS datasets, and use LDSC to standardise and pseudo-QC each GWAS.
Get European LD scores and merge with both GWAS datasets.
Perform a block jackknife, estimating the mixed fourth moments for each block omission.
Use mixed fourth moments estimates to estimate the likelihood of each possible gcp value.

Downloading & Standardising GWAS Files

The first step involves downloading and standardising two GWAS files. This example uses a GWAS for systolic blood pressure (GCST90029011) and GWAS for plasma triglyceride levels (GCST000758).

The commands required to perform this step can be found at github.com/alicesmail12/LDSC-LCV/tree/main/TriGsBP. Here is the output from Step 1:

## Heritability of phenotype 1
## ---------------------------
## Total Observed scale h2: 0.1372 (0.0242)
## Lambda GC: 1.0016
## Mean Chi^2: 1.1293
## Intercept: 0.8657 (0.0124)
## Ratio < 0 (usually indicates GC correction).
## 
## Heritability of phenotype 2/2
## -----------------------------
## Total Observed scale h2: 0.2044 (0.0081)
## Lambda GC: 2.0007
## Mean Chi^2: 2.8409
## Intercept: 1.1489 (0.0259)
## Ratio: 0.0809 (0.0141)
## 
## Genetic Covariance
## ------------------
## Total Observed scale gencov: 0.0184 (0.0036)
## Mean z1*z2: 0.0774
## Intercept: -0.0003 (0.0074)
## 
## Genetic Correlation
## -------------------
## Genetic Correlation: 0.11 (0.0205)
## Z-score: 5.3559
## P: 8.5124e-08

From the LDSC analysis, I can see that:

The heritability of plasma triglyceride level is 0.14
The heritability of systolic blood pressure is 0.20
The genetic correlation of plasma triglyceride level and systolic blood pressure is 0.11

However, I don’t actually need any of this information for LCV! Only the standardised GWAS output files are needed.

Loading Data

Using R, I can load both standardised GWAS outputs from LDSC, as well as the LD score data for each chromosome:

# Set wd
setwd(('/Users/alicesmail/Desktop/KCL/LCV/TriGBPs'))

# Get standardised data (LDSC output)
P1 = na.omit(read.table("./GWASFiles/TriG.sumstats.gz", header=TRUE, sep="\t", stringsAsFactors = FALSE))
P2 = na.omit(read.table("./GWASFiles/BPs.sumstats.gz", header=TRUE, sep="\t", stringsAsFactors = FALSE))

# Get LD scores
LDScores = read.table("./eur_w_ld_chr/1.l2.ldscore.gz", header=TRUE, sep='\t', stringsAsFactors=FALSE)
for (i in 2:22){
  x <- read.table(paste0("./eur_w_ld_chr/",i,".l2.ldscore.gz"), header=TRUE, sep='\t', stringsAsFactors=FALSE)
  LDScores <- rbind(LDScores, x)
}

# Merge LD scores with munged GWAS data
P = merge(P1, P2, by="SNP")
data = merge(LDScores, P, by="SNP")

# SNPs included in analysis
dim(data)[1]

## [1] 1134610

Preparing Data

In this analysis, there are 1,134,610 SNPs across both GWAS files that can be used for LCV analysis. Before performing a jackknife to obtain a gcp estimate, we need to clean our merged file and assign the variables that are needed:

# Sort by CHR & BP 
data = data[order(data[,"CHR"], data[,"BP"]),]

# Flip sign of one z-score if opposite alleles
mismatch = which(data$A1.x!=data$A1.y,arr.ind=TRUE)
data[mismatch,]$Z.y = data[mismatch,]$Z.y*-1
data[mismatch,]$A1.y = data[mismatch,]$A1.x
data[mismatch,]$A2.y = data[mismatch,]$A2.x

# Assign LD & Z-scores to variables
LDScores <- data$L2
P1Zscores <- data$Z.x
P2Zscores <- data$Z.y

Now the Z-scores from both GWAS datasets across all 1,134,610 SNPs and their corresponding LD scores are ordered, we can perform a block jackknife.

Performing a Block Jackknife

Three functions defined in the file ‘MomentFunctions.R’ at github.com/lukejoconnor/LCV are needed for this analysis:

WeightedRegression: performs a regression using a defined set of weights.
WeightedMean: gets the mean of a list of values using a defined set of weights.
EstimateK4: estimates the mixed fourth moments between two sets of Z-scores.

source("/Users/alicesmail/Desktop/KCL/LCV/TriGBPs/MomentFunctions.R")

This analysis also needs a few defined variables:

The number of blocks for our jackknife (100).
The prior intercepts for heritability & genetic correlation analyses.
The weighting for each SNP.

# Block number for Jackknife
nBlocks <- 100

# Set intercepts and weights
crosstrait.intercept <- 1
ldsc.intercept <- 1
intercept.12 <- 0

# Get weights
weights <- 1/pmax(1,LDScores)

Prior to the jackknife, it is important to check that the SNPs & LD scores align:

# Get LD score length
nSNP=length(LDScores)

# Check LD score is same length as Z-scores
if(length(P1Zscores)!=nSNP||length(P2Zscores)!=nSNP){
  stop('LD scores and summary statistics should have the same length')
}

Now I can generate a grid of all the possible values of the gcp estimate (from -1 to 1):

# Grid of all gcp values
grid <- (-100:100)/100

# Get size of blocks
size.blocks=floor(nSNP/nBlocks)

# Generate empty 8x100 matrix 
jackknife=matrix(0, nBlocks, 8)

# Block size
print(size.blocks)

## [1] 11346

The size of each block of SNPs is 11,346 - roughly 1/100 of the total number of SNPs.

I can perform the jackknife procedure to obtain the mixed fourth moments for each block omission:

# For each jackknife, estimate rho, mixed fourth moments and heritability intercepts
for(jk in 1:nBlocks){
  
  # Generate block of SNPs for jackknife
  if(jk==1) {ind <- (size.blocks+1):nSNP} 
  else if(jk==nBlocks) {ind <- 1:(size.blocks*(jk-1))} 
  else {ind <- c(1:((jk-1)*size.blocks), (jk*size.blocks+1):nSNP)}
  
  # Jackknife estimates
  jackknife[jk,] <- EstimateK4(LDScores[ind], P1Zscores[ind], P2Zscores[ind],
                               crosstrait.intercept, ldsc.intercept, weights[ind],
                               sig.threshold=.Machine$integer.max, n.1=1, n.2=1, intercept.12, 8)
  
  # Catch errors
  if(any(is.nan(jackknife))){stop('NaNs produced.')}
}

Checking for Outliers

Sometimes a jackknife can produce mixed fourth moments that are ‘outliers’ - when a specific block of SNPs is excluded in one jackknife, the estimates are much different than the estimates from other jackknifes. This can be checked for by making a scatter plot:

# Check for outlier blocks
jackknifedf <- as.data.frame(jackknife)
jackknifedf$colour <- NA

# Plot
ggplot(jackknifedf, aes(x=jackknifedf[,2], y=jackknifedf[,3], colour=colour))+
  geom_point(size=4, alpha=0.5)+
  theme+ 
  ylab("Mixed Fourth Moments 1")+ xlab("Mixed Fourth Moments 2")+ 
  guides(colour="none")+ 
  theme(text = element_text(size = 16))

From the scatter plot, we can see that there are not any massive outliers. If there were any outliers, the jackknife blocks these correspond to could be removed to check if they affect the overall results.

Genetic Correlation

It is good to check that the genetic correlation estimate from LCV is somewhat similar to the estimate from LDSC. These values won’t be exactly the same as LCV and LDSC use slightly different methods to estimate genetic correlation, and we are using a smaller set of SNPs.

# Get mean rho for rho estimate (genetic correlation)
rho.est <- mean(jackknife[,1])

# Get rho standard error for rho estimate (genetic correlation)
rho.err <- sd(jackknife[,1])*sqrt(nBlocks+1)
flip <- sign(rho.est)

message('Genetic correlation: ', round(rho.est, 4), ' (', round(rho.err, 4),')')

## Genetic correlation: 0.1239 (0.0233)

The genetic correlation estimate from LCV is 0.12, which is very similar to the LDSC estimate of 0.11.

Estimating the gcp Value

Now we have our mixed fourth moments, we can compute the likelihood of each specific gcp value. The function below, from github.com/lukejoconnor/LCV, does this:

# Get mixed fourth moments
jackknife[,2:3] <- jackknife[,2:3]-3*jackknife[,1]

# Get dataframe corresponding to all possible gcp values
gcp.likelihood.trigbps <- grid
gcp.likelihood.trigbps[] <- 0

# For each value in the gcp grid get the likelihood
for(kk in 1:length(grid)){
  
  # Compute likelihood
  xx <- grid[kk]
  fx <- abs(jackknife[,1])^(-xx)
  numer <- jackknife[,2]/fx-fx*jackknife[,3]
  denom <- pmax(1/abs(jackknife[,1]), sqrt(jackknife[,2]^2/fx^2+ jackknife[,3]^2*fx^2))
  pct.diff <- numer/denom
  gcp.likelihood.trigbps[kk] <- dt(mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1), nBlocks-2)
  
  # Get p-value when gcp=1 and gcp=-1
  if(xx==-1){pval.fullycausal.2 <- pt(-flip*mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1),nBlocks-2)}
  if(xx==1){pval.fullycausal.1 <- pt(flip*mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1),nBlocks-2)}
  
  # Get zscore when gcp=0 (tests whether gcp is different from 0)
  if(xx==0){zscore <- flip*mean(pct.diff)/sd(pct.diff)/sqrt(nBlocks+1)}

}

Inspecting gcp Likelihoods

To inspect the computed likelihood estimates for each gcp value, we can create a heatmap as follows:

# gcp likelihood dataframe
gcp.likelihood.trigbps.df <- data.frame(gcp.likelihood.trigbps)
gcp.likelihood.trigbps.df$gcp.value <- (-100:100)/100

# Plot
ggplot(gcp.likelihood.trigbps.df, aes(x = gcp.value, y = 0.5, fill = gcp.likelihood.trigbps))+ 
  geom_tile(color="black", size=1)+
  geom_tile(fill='white')+
  geom_tile(alpha=0.75)+ 
  ylab("")+
  theme+
  scale_fill_gradientn(colours=colorRampPalette(c('#5f8fb0', '#93b572', '#e7d044', '#d10000'))(100),
                       na.value='#5f8fb0',
                       guide=guide_colorbar(frame.colour="black", 
                                            ticks.colour="black", 
                                            alpha=0.75, 
                                            title='Likelihood'))+
  scale_y_discrete(expand=(c(0,0.8)))

Computing the gcp Estimate

From the heatmap, we can see that it is much more likely that the gcp value is nearer 1 than near -1. To get a single gcp estimate, we take the mean of all possible gcp values, weighted by their computed likelihood:

# Estimate gcp value: Get mean of all gcp values (weighted by likelihood)
gcp.pm.trigbps <- WeightedMean(grid, gcp.likelihood.trigbps)

# Estimate gcp value: estimate standard error
gcp.pse.trigbps <- sqrt(WeightedMean(grid^2, gcp.likelihood.trigbps)-gcp.pm.trigbps^2)

# Print
message('gcp estimate: ', round(gcp.pm.trigbps, 2), ' (', round(gcp.pse.trigbps, 2),')')

## gcp estimate: 0.79 (0.17)

A gcp estimate of 0.79 indicates a direction of causality from plasma triglyceride level to systolic blood pressure. In other words, the LCV model predicts that systolic blood pressure is affected by plasma triglyceride level - not the other way around.

Looking at Cokurtosis Values

How is the gcp value derived? As a simplified overview, we can look at the cokurtosis estimates for plasma triglyceride level to systolic blood pressure and systolic blood pressure to plasma triglyceride level - roughly, the ratio between the two indicates what direction the gcp estimate might go in (positive or negative):

# Create df
df <- data.frame(comparison=c('TriGsBP', 'sBPTriG'),
                 cokurtosis=c(mean(jackknife[,2]), mean(jackknife[,3])),
                 sd=c(sd(jackknife[,2]), sd(jackknife[,3])))

# Plot
ggplot(df, aes(x=comparison, y=cokurtosis))+
  geom_col(fill=c('#5f8fb0', '#93b572'), colour='black', position = position_dodge(width = 0.5))+
  theme +
  xlab('CoKurtosis Direction') + ylab('CoKurtosis') +
  geom_errorbar(aes(ymin = cokurtosis-sd,ymax = cokurtosis+sd), width = 0.2)

Here it can be seen that the cokurtosis estimate where plasma triglyceride level is the primary variable is much larger than the cokurtosis estimate where systolic blood pressure is the primary variable. A large difference between these two cokurtosis measures results in a large gcp value (near -1 or 1) - the ratio between these two estimates gives us an idea of how our gcp estimate was derived.

Hypothesis Testing

After computing a gcp estimate, we can also perform a test on the null hypothesis that gcp=0:

# Estimate p-value (tests whether gcp is different from 0)
pval.gcpzero.2tailed.trigbps=pt(-abs(zscore),nBlocks-1)*2

# Print
message('P-value: ', formatC(pval.gcpzero.2tailed.trigbps, format = "e", digits = 2))

## P-value: 4.68e-47

A p-value of 4.68e-47 is highly significant, meaning that we can reject the null hypothesis that gcp=0.

Heritability

It is also a good idea to check the heritability z-scores for both of the GWAS datasets we used: a z-score below 7 may indicate a trait that is not heritable enough for LCV to work correctly.

# Get mean z-scores for each GWAS
mean.zscore.1.trig  <-  mean(jackknife[,5])
mean.zscore.2.bps  <-  mean(jackknife[,6])

# Get h2 z-score
h2.zscore.1.trig <- mean(jackknife[,5])/sd(jackknife[,5])/sqrt(nBlocks+1)
h2.zscore.2.bps <- (mean(jackknife[,6])/sd(jackknife[,6]))/sqrt(nBlocks+1)

# Create df
h2.zscores <- data.frame('GWAS'=c('Triglycerides', 'sBP'), 'h2 Z-score'=c(h2.zscore.1.trig, h2.zscore.2.bps))

# Plot
ggplot(h2.zscores, aes(x=GWAS))+
  geom_col(aes(y=h2.Z.score), fill=c('#5f8fb0', '#93b572'), alpha=0.75, colour='black')+
  geom_hline(yintercept=7, colour='black', linetype='dashed')+
  geom_hline(yintercept=4, colour='black', linetype='solid')+
  scale_y_continuous(limits=c(0,40), expand=c(0,0), name=bquote(h^2~`Z-score`))+
  theme

The graph indicates that both plasma triglyceride level and systolic blood pressure are heritable enough for the LCV model to work reliably.

Summary

By applying the LCV model to two different GWAS datasets for plasma triglyceride level and systolic blood pressure, we can conclude that plasma triglyceride level is likely to affect systolic blood pressure.