The relationship between the product of the z-scores from two studies
of traits with some genetic correlation is similar to that of a single
polygenic trait: for a polygenic trait,
Where the slope estimates genetic covariance per SNP (instead of heritability):
From this slope estimate, the genetic correlation value can be calculated using the equation :
Essentially, similar z-scores with a large magnitude for the same SNP in two different GWASes, will result in a larger the z-score product and a steeper regression line:
An example hypothesis could be that the strong association between
LDSC software is employed through the command line, or via a bash script. The first step of this workflow that I have written in a bash file downloads some relevant datasets from GWAS catalog:
# Define working directory
WORKING=/UsingLDSC/BMI_T2D/
# Download GWAS summary statistics: T2D and BMI
wget http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90029001-GCST90030000/GCST90029024/GCST90029024_buildGRCh37.tsv -P $WORKING
wget http://ftp.ebi.ac.uk//pub/databases/gwas/summary_statistics/GCST90428001-GCST90429000/GCST90428119/GCST90428119.tsv.gz -P $WORKING
Before using LDSC, a prior step can be performed that removes the
# Remove MHC region
$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90428119.tsv \
--sep t \
--logFile $WORKING/Logs/BMIMHC \
--outFile $WORKING/GWASFiles/BMIMHC
$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90029024_buildGRCh37.tsv \
--sep t \
--logFile $WORKING/Logs/T2DMHClog \
--outFile $WORKING/GWASFiles/T2DMHC
The output logs from these commands should be inspected to check nothing went wrong:
cat /Users/alicesmail/Desktop/KCL/LCV/Logs/BMIMHCrm.log
## *********************************************************************
## Removing MHC region for GCST90428119.tsv
## *********************************************************************
##
## Reading GWAS file
## 18895667 SNPs read
## GWAS file head:
## chromosome base_pair_location effect_allele ... ci_lower ci_upper n
## 0 9 10772003 A ... 0.040816 0.393132 342566
## 1 9 10772182 C ... -0.006017 0.007986 342566
## 2 9 10772208 A ... -0.008168 0.003294 342566
## 3 9 10772286 T ... 0.043258 0.393247 342566
## 4 9 10772288 C ... -0.006007 0.007995 342566
##
## [5 rows x 13 columns]
## Standardising GWAS colnames
## Removing SNPs with no rsID
## 885937 SNPs with no rsID
## Identifying genome build
## Genome Build: GRCh37
## Removing SNPs in MHC region
## 67623 SNPs removed
## Saving file
## *********************************************************************
Now that the MHC region is removed from both GWAS files, we can set up an environment to perform LDSC. First I load Python 2.7, as this version works with LDSC:
# Within working dir activate python 2.7 (required for LDSC)
cd $WORKING
CONDA_SUBDIR=osx-64 conda create -n py27 python=2.7
conda activate py27
conda config --env --set subdir osx-64
Then I can clone the LDSC software from the LDSC repository (github.com/bulik/ldsc):
# Clone LSDC software into working dir
cd $WORKING
git clone https://github.com/bulik/ldsc.git
Once LDSC is cloned, I can activate it and check it is working:
# Define software space
SOFTWARE==/UsingLDSC/BMI_T2D/ldsc
# Activate LDSC
cd ldsc
conda env create --file environment.yml
conda activate ldsc
# Check LDSC is working
./munge_sumstats.py -h
# Munge T2D GWAS
$SOFTWARE/munge_sumstats.py \
--sumstats $WORKING/GWASFiles/T2DMHC.tsv \
--N 468298 \
--snp SNP \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--out $WORKING/T2D
# Munge BMI GWAS
$SOFTWARE/munge_sumstats.py \
--sumstats $WORKING/GWASFiles/BMIMHC.tsv \
--N 342566 \
--snp SNP \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--out $WORKING/BMI
To check the files were munged without any issues, I can check the log:
cat $WORKING/BMI.log
## *********************************************************************
## * LD Score Regression (LDSC)
## * Version 1.0.1
## * (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
## * Broad Institute of MIT and Harvard / MIT Department of Mathematics
## * GNU General Public License v3
## *********************************************************************
## Call:
## ./munge_sumstats.py \
## --out /Users/alicesmail/Desktop/KCL/LCV/BMI \
## --merge-alleles /Users/alicesmail/Desktop/KCL/LCV/w_hm3.snplist \
## --N 342566.0 \
## --snp rs_id \
## --sumstats /Users/alicesmail/Desktop/KCL/LCV/GCST90428119.tsv.gz
##
## Interpreting column names as follows:
## p_value: p-Value
## rs_id: Variant ID (e.g., rs number)
## other_allele: Allele 2, interpreted as non-ref allele for signed sumstat.
## n: Sample size
## beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)
## effect_allele: Allele 1, interpreted as ref allele for signed sumstat.
##
## Reading list of SNPs for allele merge from w_hm3.snplist
## Read 1217311 SNPs for allele merge.
## Reading sumstats from GCST90428119.tsv.gz into memory 5000000 SNPs at a time.
## Read 18895667 SNPs from --sumstats file.
## Removed 16799176 SNPs not in --merge-alleles.
## Removed 885937 SNPs with missing values.
## Removed 0 SNPs with INFO <= 0.9.
## Removed 0 SNPs with MAF <= 0.01.
## Removed 0 SNPs with out-of-bounds p-values.
## Removed 360901 variants that were not SNPs or were strand-ambiguous.
## 849653 SNPs remain.
## Removed 1904 SNPs with duplicated rs numbers (847749 SNPs remain).
## Removed 0 SNPs with N < 228377.333333 (847749 SNPs remain).
## Median value of beta was 1.96177e-05, which seems sensible.
## Removed 1649 SNPs whose alleles did not match --merge-alleles (846100 SNPs remain).
## Writing summary statistics for 1217311 SNPs (846100 with nonmissing beta) to BMI.sumstats.gz.
##
## Metadata:
## Mean chi^2 = 2.6
## Lambda GC = 2.05
## Max chi^2 = 971.656
## 5312 Genome-wide significant SNPs (some may have been removed by filtering).
##
## Conversion finished at Mon Jun 10 16:54:59 2024
## Total time elapsed: 49.88s
Lots of information is contained in the log output for BMI. For example:
If the data has been correctly munged, I can also use LDSC to perform genetic correlation analysis:
# Get genetic correlation between BMI & T2D
$SOFTWARE/ldsc.py \
--rg $WORKING/GWASFiles/BMI.sumstats.gz,$WORKING/GWASFiles/T2D.sumstats.gz \
--ref-ld-chr $WORKING/eur_w_ld_chr/ \
--w-ld-chr $WORKING/eur_w_ld_chr/ \
--out $WORKING/BMI_T2D
Here I can use some new flags:
The log output for this function gives contains our results:
cat $WORKING/BMI_T2D.log
## *********************************************************************
## * LD Score Regression (LDSC)
## * Version 1.0.1
## * (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
## * Broad Institute of MIT and Harvard / MIT Department of Mathematics
## * GNU General Public License v3
## *********************************************************************
## Call:
## ./ldsc.py \
## --ref-ld-chr eur_w_ld_chr/ \
## --out BMI_T2D \
## --rg BMI.sumstats.gz,T2D.sumstats.gz \
## --w-ld-chr eur_w_ld_chr/
##
## Beginning analysis at Mon Jun 10 16:59:35 2024
## Reading summary statistics from BMI.sumstats.gz ...
## Read summary statistics for 846100 SNPs.
## Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read reference panel LD Scores for 1290028 SNPs.
## Removing partitioned LD Scores with zero variance.
## Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read regression weight LD Scores for 1290028 SNPs.
## After merging with reference panel LD, 822013 SNPs remain.
## After merging with regression SNP LD, 822013 SNPs remain.
## Computing rg for phenotype 2/2
## Reading summary statistics from T2D.sumstats.gz ...
## Read summary statistics for 1217311 SNPs.
## After merging with summary statistics, 822013 SNPs remain.
## 816475 SNPs with valid alleles.
##
## Heritability of phenotype 1
## ---------------------------
## Total Observed scale h2: 0.2451 (0.0088)
## Lambda GC: 2.0855
## Mean Chi^2: 2.63
## Intercept: 1.0519 (0.017)
## Ratio: 0.0319 (0.0104)
##
## Heritability of phenotype 2/2
## -----------------------------
## Total Observed scale h2: 0.0442 (0.0031)
## Lambda GC: 1.3101
## Mean Chi^2: 1.4121
## Intercept: 1.0391 (0.0132)
## Ratio: 0.095 (0.0321)
##
## Genetic Covariance
## ------------------
## Total Observed scale gencov: 0.0583 (0.0034)
## Mean z1*z2: 0.5733
## Intercept: 0.1488 (0.0092)
##
## Genetic Correlation
## -------------------
## Genetic Correlation: 0.5601 (0.035)
## Z-score: 15.9808
## P: 1.7394e-57
##
## Summary of Genetic Correlation Results
## p1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se
## BMI.sumstats.gz T2D.sumstats.gz 0.5601 0.035 15.9808 1.7394e-57 0.0442 0.0031 1.0391 0.0132 0.1488 0.0092
## Analysis finished at Mon Jun 10 16:59:44 2024
## Total time elapsed: 8.88s
From the log output I can see:
I can repeat this workflow for two traits that might not be correlated, in order to see how the results differ:
#!/bin/bash
# Define working spaces
WORKING=/UsingLDSC/Tea_Height/
# Download GWAS summary statistics: Tea and Height
wget http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90096001-GCST90097000/GCST90096926/GCST90096926_buildGRCh37.tsv.gz -P $WORKING/GWASFiles
wget http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90267001-GCST90268000/GCST90267284/GCST90267284_buildGRCh37.tsv -P $WORKING/GWASFiles
# Remove MHC region
$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90096926_buildGRCh37.tsv \
--sep t \
--logFile $WORKING/Logs/TeaMHClog \
--outFile $WORKING/GWASFiles/TeaMHC
$WORKING/RemoveMHC.py \
--sumstatsFile $WORKING/GWASFiles/GCST90267284_buildGRCh37.tsv \
--sep t \
--logFile $WORKING/Logs/HeightMHClog \
--outFile $WORKING/GWASFiles/HeightMHC
# Within working dir activate python 2.7 (required for LDSC)
cd $WORKING
CONDA_SUBDIR=osx-64 conda create -n py27 python=2.7
conda activate py27
conda config --env --set subdir osx-64
# Define software space
SOFTWARE=/UsingLDSC/Tea_Height/ldsc
# Activate LDSC
cd ldsc
conda env create --file environment.yml
conda activate ldsc
# Check LDSC is working
./munge_sumstats.py -h
# Munge T2D GWAS
$SOFTWARE/munge_sumstats.py \
--out $WORKING/TeaMHC \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--N 434171 \
--snp SNP \
--sumstats $WORKING/GWASFiles/TeaMHC.tsv
# Munge BMI GWAS
$SOFTWARE/munge_sumstats.py \
--out $WORKING/HeightMHC \
--merge-alleles $WORKING/GWASFiles/w_hm3.snplist \
--N 283749 \
--snp SNP \
--sumstats $WORKING/GWASFiles/HeightMHC.tsv
# Get genetic correlation between BMI & T2D
$SOFTWARE/ldsc.py \
--ref-ld-chr $WORKING/eur_w_ld_chr/ \
--out $WORKING/Tea_Height \
--rg $WORKING/GWASFiles/TeaMHC.sumstats.gz,$WORKING/GWASFiles/HeightMHC.sumstats.gz \
--w-ld-chr $WORKING/eur_w_ld_chr/
I can inspect the log output:
cat $WORKING/Tea_Height.log
## *********************************************************************
## * LD Score Regression (LDSC)
## * Version 1.0.1
## * (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
## * Broad Institute of MIT and Harvard / MIT Department of Mathematics
## * GNU General Public License v3
## *********************************************************************
## Call:
## ./ldsc.py \
## --ref-ld-chr /Users/alicesmail/Desktop/KCL/LCV/eur_w_ld_chr/ \
## --out /Users/alicesmail/Desktop/KCL/LCV/Tea_Park \
## --rg /Users/alicesmail/Desktop/KCL/LCV/GWASFiles/TeaMHC.sumstats.gz,/Users/alicesmail/Desktop/KCL/LCV/GWASFiles/HeightMHC.sumstats.gz \
## --w-ld-chr /Users/alicesmail/Desktop/KCL/LCV/eur_w_ld_chr/
##
## Beginning analysis at Tue Jun 25 11:06:09 2024
## Reading summary statistics from TeaMHC.sumstats.gz ...
## Read summary statistics for 1153397 SNPs.
## Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read reference panel LD Scores for 1290028 SNPs.
## Removing partitioned LD Scores with zero variance.
## Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
## Read regression weight LD Scores for 1290028 SNPs.
## After merging with reference panel LD, 1141281 SNPs remain.
## After merging with regression SNP LD, 1141281 SNPs remain.
## Computing rg for phenotype 2/2
## Reading summary statistics from HeightMHC.sumstats.gz ...
## Read summary statistics for 1217311 SNPs.
## After merging with summary statistics, 1141281 SNPs remain.
## 1021233 SNPs with valid alleles.
##
## Heritability of phenotype 1
## ---------------------------
## Total Observed scale h2: 0.0491 (0.0026)
## Lambda GC: 1.3685
## Mean Chi^2: 1.4605
## Intercept: 1.0239 (0.0108)
## Ratio: 0.052 (0.0234)
##
## Heritability of phenotype 2/2
## -----------------------------
## Total Observed scale h2: 0.4476 (0.0247)
## Lambda GC: 2.2826
## Mean Chi^2: 3.8319
## Intercept: 1.2305 (0.0524)
## Ratio: 0.0814 (0.0185)
##
## Genetic Covariance
## ------------------
## Total Observed scale gencov: 0.0044 (0.0028)
## Mean z1*z2: 0.0423
## Intercept: -0.0116 (0.0089)
##
## Genetic Correlation
## -------------------
## Genetic Correlation: 0.0298 (0.0191)
## Z-score: 1.5604
## P: 0.1187
##
## Summary of Genetic Correlation Results
## p1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se
## TeaMHC.sumstats.gz HeightMHC.sumstats.gz 0.0298 0.0191 1.5604 0.1187 0.4476 0.0247 1.2305 0.0524 -0.0116 0.0089
##
## Analysis finished at Tue Jun 25 11:06:18 2024
## Total time elapsed: 9.27s
Here we can see that:
Here I have performed genetic correlation analysis on two pairs of traits using the LDSC package. While one pair of traits (T2D & BMI) are highly genetically correlated, tea-drinking and height show no genetic correlation.