Genetic Correlation

Estimating Genetic Correlation for BMI & T2D

An example hypothesis could be that the strong association between body mass index (BMI) with type 2 diabetes (T2D) might mean that these two traits have some level of genetic correlation. This can be tested by performing LD score regression using the LDSC package, which estimates the heritability of single traits, as well as the genetic correlation between two traits.

LDSC software is employed through the command line, or via a bash script. The first step of this workflow that I have written in a bash file downloads some relevant datasets from GWAS catalog:

# Define working directory
WORKING=/UsingLDSC/BMI_T2D/

# Download GWAS summary statistics: T2D and BMI
wget http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90029001-GCST90030000/GCST90029024/GCST90029024_buildGRCh37.tsv -P $WORKING
wget http://ftp.ebi.ac.uk//pub/databases/gwas/summary_statistics/GCST90428001-GCST90429000/GCST90428119/GCST90428119.tsv.gz -P $WORKING

Removing the MHC Region

Before using LDSC, a prior step can be performed that removes the MHC region: the complex LD patterns in this region can affect heritability & genetic correlation estimates. Using the script RemoveMHC.py (at github.com/alicesmail12/LDSC) I can remove the MHC region from both of the downloaded files:

# Remove MHC region
$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90428119.tsv \
--sep t \
--logFile $WORKING/Logs/BMIMHC \
--outFile $WORKING/GWASFiles/BMIMHC            

$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90029024_buildGRCh37.tsv \
--sep t \
--logFile $WORKING/Logs/T2DMHClog \
--outFile $WORKING/GWASFiles/T2DMHC

The output logs from these commands should be inspected to check nothing went wrong:

cat /Users/alicesmail/Desktop/KCL/LCV/Logs/BMIMHCrm.log


## *********************************************************************
## Removing MHC region for GCST90428119.tsv
## *********************************************************************
## 
## Reading GWAS file
##    18895667 SNPs read
## GWAS file head:
##    chromosome  base_pair_location effect_allele  ...  ci_lower  ci_upper       n
## 0           9            10772003             A  ...  0.040816  0.393132  342566
## 1           9            10772182             C  ... -0.006017  0.007986  342566
## 2           9            10772208             A  ... -0.008168  0.003294  342566
## 3           9            10772286             T  ...  0.043258  0.393247  342566
## 4           9            10772288             C  ... -0.006007  0.007995  342566
## 
## [5 rows x 13 columns]
## Standardising GWAS colnames
## Removing SNPs with no rsID
##    885937 SNPs with no rsID
## Identifying genome build
##    Genome Build: GRCh37
## Removing SNPs in MHC region
##    67623 SNPs removed
## Saving file
## *********************************************************************

Preparing the LDSC Software

Now that the MHC region is removed from both GWAS files, we can set up an environment to perform LDSC. First I load Python 2.7, as this version works with LDSC:

# Within working dir activate python 2.7 (required for LDSC)
cd $WORKING
CONDA_SUBDIR=osx-64 conda create -n py27 python=2.7                               
conda activate py27
conda config --env --set subdir osx-64

Then I can clone the LDSC software from the LDSC repository (github.com/bulik/ldsc):

# Clone LSDC software into working dir
cd $WORKING
git clone https://github.com/bulik/ldsc.git

Once LDSC is cloned, I can activate it and check it is working:

# Define software space
SOFTWARE==/UsingLDSC/BMI_T2D/ldsc

# Activate LDSC
cd ldsc   
conda env create --file environment.yml
conda activate ldsc

# Check LDSC is working
./munge_sumstats.py -h

Munging GWAS Files

If the help information for munge_sumstats.py is printed to the terminal, LDSC has been successfully activated and it can be used on the prepared GWAS files. The first step is ‘munging’ both files, which standardises the format of each file and performs a psuedo-QC using a list of SNPs from HapMap3 (w_hm3.snplist). There are lots of flags you can specify to munge_sumstats.py - the ones I have specified below include:

–sumstats: the GWAS file location
–N: the GWAS sample size
–snp: the name of the SNP column (that contains rsIDs)
–merge-alleles: HapMap3 SNP file location
–out: output file location

# Munge T2D GWAS
$SOFTWARE/munge_sumstats.py \
--sumstats $WORKING/GWASFiles/T2DMHC.tsv \
--N 468298 \
--snp SNP \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--out $WORKING/T2D

# Munge BMI GWAS
$SOFTWARE/munge_sumstats.py \
--sumstats $WORKING/GWASFiles/BMIMHC.tsv \
--N 342566 \
--snp SNP \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--out $WORKING/BMI

To check the files were munged without any issues, I can check the log:

cat $WORKING/BMI.log

## *********************************************************************
## * LD Score Regression (LDSC)
## * Version 1.0.1
## * (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
## * Broad Institute of MIT and Harvard / MIT Department of Mathematics
## * GNU General Public License v3
## *********************************************************************
## Call: 
## ./munge_sumstats.py \
## --out /Users/alicesmail/Desktop/KCL/LCV/BMI \
## --merge-alleles /Users/alicesmail/Desktop/KCL/LCV/w_hm3.snplist \
## --N 342566.0 \
## --snp rs_id \
## --sumstats /Users/alicesmail/Desktop/KCL/LCV/GCST90428119.tsv.gz 
## 
## Interpreting column names as follows:
## p_value: p-Value
## rs_id:   Variant ID (e.g., rs number)
## other_allele:    Allele 2, interpreted as non-ref allele for signed sumstat.
## n:   Sample size
## beta:    [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)
## effect_allele:   Allele 1, interpreted as ref allele for signed sumstat.
## 
## Reading list of SNPs for allele merge from w_hm3.snplist
## Read 1217311 SNPs for allele merge.
## Reading sumstats from GCST90428119.tsv.gz into memory 5000000 SNPs at a time.
## Read 18895667 SNPs from --sumstats file.
## Removed 16799176 SNPs not in --merge-alleles.
## Removed 885937 SNPs with missing values.
## Removed 0 SNPs with INFO <= 0.9.
## Removed 0 SNPs with MAF <= 0.01.
## Removed 0 SNPs with out-of-bounds p-values.
## Removed 360901 variants that were not SNPs or were strand-ambiguous.
## 849653 SNPs remain.
## Removed 1904 SNPs with duplicated rs numbers (847749 SNPs remain).
## Removed 0 SNPs with N < 228377.333333 (847749 SNPs remain).
## Median value of beta was 1.96177e-05, which seems sensible.
## Removed 1649 SNPs whose alleles did not match --merge-alleles (846100 SNPs remain).
## Writing summary statistics for 1217311 SNPs (846100 with nonmissing beta) to BMI.sumstats.gz.
## 
## Metadata:
## Mean chi^2 = 2.6
## Lambda GC = 2.05
## Max chi^2 = 971.656
## 5312 Genome-wide significant SNPs (some may have been removed by filtering).
## 
## Conversion finished at Mon Jun 10 16:54:59 2024
## Total time elapsed: 49.88s

Lots of information is contained in the log output for BMI. For example:

There are 5312 SNPs significantly associated with BMI.
16,799,176 SNPs were removed because they are not in the HapMap3 SNP list.
885,937 SNPs were removed because they had missing values.
The median beta value was near 0 (1.96e-05), which is expected.
The output file contains 846,100 SNPs with beta values.

Performing Genetic Correlation Analysis

If the data has been correctly munged, I can also use LDSC to perform genetic correlation analysis:

# Get genetic correlation between BMI & T2D
$SOFTWARE/ldsc.py \
--rg $WORKING/GWASFiles/BMI.sumstats.gz,$WORKING/GWASFiles/T2D.sumstats.gz \
--ref-ld-chr $WORKING/eur_w_ld_chr/ \
--w-ld-chr $WORKING/eur_w_ld_chr/ \
--out $WORKING/BMI_T2D

Here I can use some new flags:

–rg: specifies the two munged GWAS files
–ref-ld-chr & –w-ld-chr specify LD score files for separate chromosomes (eur_w_ld_chr)

The log output for this function gives contains our results:

cat $WORKING/BMI_T2D.log

## *********************************************************************
## * LD Score Regression (LDSC)
## * Version 1.0.1
## * (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
## * Broad Institute of MIT and Harvard / MIT Department of Mathematics
## * GNU General Public License v3
## *********************************************************************
## Call: 
## ./ldsc.py \
## --ref-ld-chr eur_w_ld_chr/ \
## --out BMI_T2D \
## --rg BMI.sumstats.gz,T2D.sumstats.gz \
## --w-ld-chr eur_w_ld_chr/ 
## 
## Beginning analysis at Mon Jun 10 16:59:35 2024
## Reading summary statistics from BMI.sumstats.gz ...
## Read summary statistics for 846100 SNPs.
## Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read reference panel LD Scores for 1290028 SNPs.
## Removing partitioned LD Scores with zero variance.
## Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read regression weight LD Scores for 1290028 SNPs.
## After merging with reference panel LD, 822013 SNPs remain.
## After merging with regression SNP LD, 822013 SNPs remain.
## Computing rg for phenotype 2/2
## Reading summary statistics from T2D.sumstats.gz ...
## Read summary statistics for 1217311 SNPs.
## After merging with summary statistics, 822013 SNPs remain.
## 816475 SNPs with valid alleles.
## 
## Heritability of phenotype 1
## ---------------------------
## Total Observed scale h2: 0.2451 (0.0088)
## Lambda GC: 2.0855
## Mean Chi^2: 2.63
## Intercept: 1.0519 (0.017)
## Ratio: 0.0319 (0.0104)
## 
## Heritability of phenotype 2/2
## -----------------------------
## Total Observed scale h2: 0.0442 (0.0031)
## Lambda GC: 1.3101
## Mean Chi^2: 1.4121
## Intercept: 1.0391 (0.0132)
## Ratio: 0.095 (0.0321)
## 
## Genetic Covariance
## ------------------
## Total Observed scale gencov: 0.0583 (0.0034)
## Mean z1*z2: 0.5733
## Intercept: 0.1488 (0.0092)
## 
## Genetic Correlation
## -------------------
## Genetic Correlation: 0.5601 (0.035)
## Z-score: 15.9808
## P: 1.7394e-57
## 
## Summary of Genetic Correlation Results
##               p1               p2      rg     se        z           p  h2_obs  h2_obs_se  h2_int  h2_int_se  gcov_int  gcov_int_se
##  BMI.sumstats.gz  T2D.sumstats.gz  0.5601  0.035  15.9808  1.7394e-57  0.0442     0.0031  1.0391     0.0132    0.1488       0.0092
## Analysis finished at Mon Jun 10 16:59:44 2024
## Total time elapsed: 8.88s

From the log output I can see:

816475 SNPs were used for the genetic correlation analysis.
The heritability of BMI is 0.25, while the heritability of T2D is 0.04.
The genetic correlation of BMI & T2D is 0.56, and is highly significant (p=1.74e-57).

Estimating Genetic Correlation for Tea Drinking & Height

I can repeat this workflow for two traits that might not be correlated, in order to see how the results differ:

#!/bin/bash

# Define working spaces
WORKING=/UsingLDSC/Tea_Height/

# Download GWAS summary statistics: Tea and Height
wget http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90096001-GCST90097000/GCST90096926/GCST90096926_buildGRCh37.tsv.gz -P $WORKING/GWASFiles
wget http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90267001-GCST90268000/GCST90267284/GCST90267284_buildGRCh37.tsv -P $WORKING/GWASFiles

# Remove MHC region
$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90096926_buildGRCh37.tsv \
--sep t \
--logFile $WORKING/Logs/TeaMHClog \
--outFile $WORKING/GWASFiles/TeaMHC            

$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90267284_buildGRCh37.tsv \
--sep t \
--logFile $WORKING/Logs/HeightMHClog \
--outFile $WORKING/GWASFiles/HeightMHC

# Within working dir activate python 2.7 (required for LDSC)
cd $WORKING
CONDA_SUBDIR=osx-64 conda create -n py27 python=2.7                               
conda activate py27
conda config --env --set subdir osx-64         

# Define software space
SOFTWARE=/UsingLDSC/Tea_Height/ldsc

# Activate LDSC
cd ldsc   
conda env create --file environment.yml
conda activate ldsc

# Check LDSC is working
./munge_sumstats.py -h

# Munge T2D GWAS
$SOFTWARE/munge_sumstats.py \
--out $WORKING/TeaMHC \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--N 434171 \
--snp SNP \
--sumstats $WORKING/GWASFiles/TeaMHC.tsv

# Munge BMI GWAS
$SOFTWARE/munge_sumstats.py \
--out $WORKING/HeightMHC \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--N 283749 \
--snp SNP \
--sumstats $WORKING/GWASFiles/HeightMHC.tsv

# Get genetic correlation between BMI & T2D
$SOFTWARE/ldsc.py \
--ref-ld-chr $WORKING/eur_w_ld_chr/ \
--out $WORKING/Tea_Height \
--rg $WORKING/GWASFiles/TeaMHC.sumstats.gz,$WORKING/GWASFiles/HeightMHC.sumstats.gz \
--w-ld-chr $WORKING/eur_w_ld_chr/

I can inspect the log output:

cat $WORKING/Tea_Height.log

## *********************************************************************
## * LD Score Regression (LDSC)
## * Version 1.0.1
## * (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
## * Broad Institute of MIT and Harvard / MIT Department of Mathematics
## * GNU General Public License v3
## *********************************************************************
## Call: 
## ./ldsc.py \
## --ref-ld-chr /Users/alicesmail/Desktop/KCL/LCV/eur_w_ld_chr/ \
## --out /Users/alicesmail/Desktop/KCL/LCV/Tea_Park \
## --rg /Users/alicesmail/Desktop/KCL/LCV/GWASFiles/TeaMHC.sumstats.gz,/Users/alicesmail/Desktop/KCL/LCV/GWASFiles/HeightMHC.sumstats.gz \
## --w-ld-chr /Users/alicesmail/Desktop/KCL/LCV/eur_w_ld_chr/ 
## 
## Beginning analysis at Tue Jun 25 11:06:09 2024
## Reading summary statistics from TeaMHC.sumstats.gz ...
## Read summary statistics for 1153397 SNPs.
## Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read reference panel LD Scores for 1290028 SNPs.
## Removing partitioned LD Scores with zero variance.
## Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read regression weight LD Scores for 1290028 SNPs.
## After merging with reference panel LD, 1141281 SNPs remain.
## After merging with regression SNP LD, 1141281 SNPs remain.
## Computing rg for phenotype 2/2
## Reading summary statistics from HeightMHC.sumstats.gz ...
## Read summary statistics for 1217311 SNPs.
## After merging with summary statistics, 1141281 SNPs remain.
## 1021233 SNPs with valid alleles.
## 
## Heritability of phenotype 1
## ---------------------------
## Total Observed scale h2: 0.0491 (0.0026)
## Lambda GC: 1.3685
## Mean Chi^2: 1.4605
## Intercept: 1.0239 (0.0108)
## Ratio: 0.052 (0.0234)
## 
## Heritability of phenotype 2/2
## -----------------------------
## Total Observed scale h2: 0.4476 (0.0247)
## Lambda GC: 2.2826
## Mean Chi^2: 3.8319
## Intercept: 1.2305 (0.0524)
## Ratio: 0.0814 (0.0185)
## 
## Genetic Covariance
## ------------------
## Total Observed scale gencov: 0.0044 (0.0028)
## Mean z1*z2: 0.0423
## Intercept: -0.0116 (0.0089)
## 
## Genetic Correlation
## -------------------
## Genetic Correlation: 0.0298 (0.0191)
## Z-score: 1.5604
## P: 0.1187
## 
## Summary of Genetic Correlation Results
##  p1                   p2      rg      se       z       p  h2_obs  h2_obs_se  h2_int  h2_int_se  gcov_int  gcov_int_se
##  TeaMHC.sumstats.gz  HeightMHC.sumstats.gz  0.0298  0.0191  1.5604  0.1187  0.4476     0.0247  1.2305     0.0524   -0.0116       0.0089
## 
## Analysis finished at Tue Jun 25 11:06:18 2024
## Total time elapsed: 9.27s

Here we can see that:

1021233 SNPs were used for the genetic correlation analysis.
The heritability of tea consumption is 0.05, while the heritability of height is 0.45.
The genetic correlation between tea consumption & height is 0.03, which is not significant (p=0.12).

Conclusion

Here I have performed genetic correlation analysis on two pairs of traits using the LDSC package. While one pair of traits (T2D & BMI) are highly genetically correlated, tea-drinking and height show no genetic correlation.