MR-Hevo

Introduction

This vignette demonstrates an analysis using MRHevo to overcome some of the limitations of current methods for instrumental variable analysis with genetic instruments (“Mendelian randomization”).

Multiple variants are used to construct each unlinked instrument, rather than selecting a single SNP from each genomic region that is associated with the exposure.
There is no need to exclude weak instruments because they are correctly handled by Bayesian inference.
Inference of causality in the presence of pleiotropic effects is based on marginalizing over the distribution of pleiotropic effects to compute the likelihood as a function of the causal effect parameter, rather than on constructing “estimators” that downweight or exclude those instruments that appear to be outliers. The statistical model is described on [this page](https://github.com/molepi-precmed/mrhevo/blob/main/theorymethods.pdf.

Where multiple SNPs are used to calculate each scalar instrument from summary statistics for the genotype-exposure association, an individual-level dataset with genotypes and outcome is required in step 2 to calculate summary statistics for the effects of these instrument on the outcome. The model-fitting can use either summary statistics (fast) or individual-level data (slow if the dataset is large).

Input dataset

The input dataset should comprise:

Y a vector of length N containing outcomes (in this example type 2 diabetes as a binary outcome
Z an N x J data.table of unlinked genetic instruments.
X_u an N x U data.table of unpenalized covariates (this will usually include scores on the first few principal components of the genotype matrix
Z.metadata a data.table with J rows, one for each genetic instrument. Estimated coefficients for the effects of instruments Z on the exposure, and the corresponding standard errors of these estimates are in columns named alpha_hat and se_alpha_hat respectively. The column scoreid should contain the names of the columns of Z. This table can include other metadata such as the genomic position of each genotypic instrument, or the names of nearby genes.

In this example the genetic instruments consist of 12 trans-pQTLs for the levels of adiponectin, encoded by the gene ADIPOQ in whole blood. There are also two cis-pQTLs that are not used in the instrumental variable analysis, because cis-acting variants in or near this gene are known to affect the splicing of the transcript. Summary statistics for these pQTLs were derived from the DeCODE proteomics study which used the Somalogic platform

The trans-pQTLs are far enough apart to be modelled as independent. The target dataset is the UK Biobank cohort, and the outcome is type 2 diabetes. The unpenalized covariates are sex and the first five principal components of the genotype matrix.


library(data.table)
library(ggplot2)
library(rstan)
require(knitr)

source("functions.mrhevo.R")

load("data_mrhevo.RData")

if(length(unique(Y)) == 2) {
  logistic <- TRUE
} else {
  logistic <- FALSE
}

As a first step, we compute summary statistics for the effect of each instrument on the outcome, and the ratio of the instrument-outcome coefficient beta_YZ to the instrument-exposure coefficient alpha_hat. An approximate standard error for the ratio theta_IV, defined as beta_YZ / alpha_hat, is calculated by the delta method (a first-order Taylor expansion).

Summary statistics


coeffs.scores <- get_summarystatsforMR(Y, Z, X_u)

coeffs.scores <- Z.metadata[coeffs.scores, on="scoreid"]
coeffs.scores[, thetaIV := beta_YZ / alpha_hat]  # ratio estimates
coeffs.scores[, se.thetaIV := SE_YZ / alpha_hat] 
coeffs.scores[, se.thetaIV_delta := sqrt((SE_YZ / alpha_hat)^2 +
                                     beta_YZ^2 * se.alpha_hat^2 / alpha_hat^4)] # delta method

knitr::kable(coeffs.scores[, .(qtlname, qtl_type, alpha_hat, se.alpha_hat,
                               beta_YZ, SE_YZ, thetaIV, z, p)])

qtlname	qtl_type	alpha_hat	se.alpha_hat	beta_YZ	SE_YZ	thetaIV	z	p
c1:219.4	trans	0.0509899	0.0070639	-0.0292637	0.0051390	-0.5739115	-5.6944050	0.0000000
c1:77.5	trans	0.0344627	0.0064386	0.0037473	0.0051909	0.1087352	0.7218937	0.4703598
c1:9.9	trans	0.0361292	0.0071903	-0.0109785	0.0052606	-0.3038681	-2.0869398	0.0368936
ALAS1	trans	0.0473569	0.0080025	-0.0173131	0.0051742	-0.3655881	-3.3460674	0.0008197
c6:163.3	trans	0.0254150	0.0051803	-0.0023461	0.0051326	-0.0923101	-0.4570915	0.6476053
TRIB1	trans	0.0438447	0.0073081	-0.0120452	0.0050825	-0.2747241	-2.3699548	0.0177903
MIR511	trans	0.0393044	0.0065677	-0.0046359	0.0050883	-0.1179480	-0.9110785	0.3622540
ADRB1	trans	0.0326294	0.0062769	-0.0177924	0.0050890	-0.5452866	-3.4962561	0.0004718
CMIP	trans	0.0539801	0.0005398	-0.0295299	0.0050676	-0.5470516	-5.8271689	0.0000000
c17:67.4	trans	0.0333792	0.0066200	-0.0084370	0.0051326	-0.2527629	-1.6438039	0.1002167
APOC1	trans	0.0327474	0.0061291	-0.0461022	0.0052367	-1.4078132	-8.8036521	0.0000000
LKAAEAR1	trans	0.0340694	0.0063322	-0.0235592	0.0050769	-0.6915072	-4.6404990	0.0000035
ABCC5	cis	0.0566667	0.0114463	0.0059991	0.0051334	0.1058665	1.1686508	0.2425444
ADIPOQ	cis	0.3288945	0.0032889	0.0050036	0.0051477	0.0152133	0.9719955	0.3310528

From the summary statistics (excluding the cis-eQTL) we can obtain three widely-used estimators of the causal effect parameter: the inverse-variance weighted mean, the weighted median, and the penalized weighted median. As expected, the weighted median and the penalized weighted median downweight the outlier and give less extreme estimates of the causal effect.

Estimators of the causal effect based on summary statistics


estimators.dt <- get_estimatorsMR(coeffs.scores[qtl_type=="trans"])[, 1:5] 

knitr::kable(estimators.dt)

Estimator	Estimate	SE	z	pvalue
Weighted mean	-0.3714990	0.0418619	-8.874400	0e+00
Weighted median	-0.3475400	0.0679252	-5.116511	3e-07
Penalized weighted median	-0.3584349	0.0698595	-5.130796	3e-07

A plot of beta_YZ against alpha_hat shows that 11 of the 12 trans-pQTL instruments are inversely associated with type 2 diabetes. The _cis- pQTLs have large effects on circulating levels of the protein, but no effect on type 2 diabetes; this is this is explicable, as the cis- acting SNPs alter the splicing of ADIPOQ. Of the trans-pQTLs, the APOC1 locus is an outlier, with a more extreme ratio of beta_YZ to alpha_hat than the others.


## size of points inversely proportional to standard error 
coeffs.scores[, size.thetaIV := 0.5 * sum(se.thetaIV_delta) / se.thetaIV_delta] 

gene.query <- "ADIPOQ"
long.phenoname <- "type 2 diabetes"
p.coeffs <- ggplot(coeffs.scores,
                                    aes(x=alpha_hat, y=beta_YZ, color=qtl_type)) + 
      geom_point(aes(size=size.thetaIV), alpha=0.8) +
      scale_size(guide="none") + 
    ggrepel::geom_text_repel(aes(label=qtlname), force=2, size=4, fontface="italic", color="blue") + 
    scale_x_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) + 
    scale_y_continuous(expand =  expansion(mult = c(0.1, 0.1))) + 
    geom_abline(data=estimators.dt,
                aes(slope=Estimate, intercept=rep(0, 3),
                    linetype=Estimator)) +
    theme(legend.position="top") +
    theme(legend.direction="vertical") +
    theme(legend.box="horizontal") +
    guides(color=guide_legend(title="QTL type")) +
    xlab(paste("Effect of genetic instrument on ", gene.query)) +
    ylab(paste("Effect on", long.phenoname)) +
  coord_fixed(ratio=2)
save(p.coeffs, file="p.coeffs.ADIPOQ.RData")
p.coeffs

Sample the posterior density

The function run_mrhevo.sstats will call Stan to sample the posterior distribution of all parameters, given the model and the data consisting of summary statistics. It returns an object of class stanfit

As recommended by Piironen and Vehtari (2017), we supply a prior guess of the fraction of instruments that have nonzero pleiotropic effects. With a large enough sample size, this prior guess will not make much difference to the results. For this example, we guess the fraction to be 0.2. This guess is used to set the scale of the prior on the global shrinkage parameter.

The causal effect parameter theta is assigned a Gaussian prior with mean zero and standard deviation priorsd.theta. As we will divide the posterior by the prior to obtain the likelihood as a function of theta, the setting of this prior should not make much difference but the sampler will be more efficient if the prior standard deviation is set to a value that is supported by the data.

To speed up computation in this large dataset, the statistical model is fitted to summary statistics rather than to individual-level data.


options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

use.sampling <- TRUE
logistic <- TRUE
priorsd_theta <- 1
newrun <- FALSE

which.trans <- colnames(Z) %in% Z.metadata[qtl_type=="trans", scoreid]
J <- length(which.trans)
N <- nrow(Z)

gamma_hat <- coeffs.scores[qtl_type=="trans", beta_YZ]
se.gamma_hat <- coeffs.scores[qtl_type=="trans", SE_YZ]
alpha_hat <- coeffs.scores[qtl_type=="trans", alpha_hat]
se.alpha_hat <- coeffs.scores[qtl_type=="trans", se.alpha_hat]

hevo.stanfit <- run_mrhevo.sstats(alpha_hat, se.alpha_hat,
                                  gamma_hat, se.gamma_hat,
                                  fraction_pleio=0.2,
                                  priorsd_theta=priorsd_theta)
#> compiling stan model ... done
#> Sampling posterior distribution ... 
#> CHECKING DATA AND PREPROCESSING FOR MODEL 'MRHevo.summarystats' NOW.
#> 
#> COMPILING MODEL 'MRHevo.summarystats' NOW.
#> 
#> STARTING SAMPLER FOR MODEL 'MRHevo.summarystats' NOW.
#> Warning: There were 7 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
#> done

Maximum likelihood estimate and p-value

From the posterior samples, we compute the likelihood as a function of theta. We fit a quadratic approximation to the log-likelihood, and use this to obtain a classical maximum likelihood and p-value.

A plot of the posterior density of theta shows that it is approximately Gaussian, and a plot of the log-likelihood shows that it can be approximated by a quadratic function. This guarantees that the value of theta that maximizes the likelihood will have the sampling properties – consistency, unbiasedness, and minimum variance – that are desirable for an “estimator”.


theta.samples <-  unlist(extract(hevo.stanfit, pars="theta"))
prior.theta <- dnorm(theta.samples, mean=0, sd=priorsd_theta)
mle.theta <- mle.se.pval(x=theta.samples, prior=prior.theta)
mle.theta[, Estimator := "Marginalize over direct effects"]

p.bayesianloglik <-  mle.se.pval(x=theta.samples, prior=prior.theta, return.asplot=TRUE)
p.bayesianloglik

Comparison with other estimates

Marginalizing over the distribution of unobserved pleiotropic effects is more conservative than the standard methods for inferring the causal effect parameter, at least in this example.


methods.dt <- rbind(estimators.dt, mle.theta, fill=TRUE)
knitr::kable(methods.dt[, .(Estimator, Estimate, SE, z, pvalue)])

Estimator	Estimate	SE	z	pvalue
Weighted mean	-0.3714990	0.0418619	-8.874400	0.00e+00
Weighted median	-0.3475400	0.0679252	-5.116511	3.00e-07
Penalized weighted median	-0.3584349	0.0698595	-5.130796	3.00e-07
Marginalize over direct effects	-0.3754964	0.0911071	-4.121484	3.76e-05

Shrinkage coefficients

For each instrument, we can compute from the model parameters a shrinkage coefficent kappa that takes values between 0 (no shrinkage of the pleiotropic effect) and 1 (complete shrinkage of the effect to zero). The posterior distribution of the J shrinkage coefficients has the horseshoe shape specified by the prior.


kappa.all <- extract(hevo.stanfit, pars="kappa")$kappa
kappa.dt <- data.table(kappa=as.numeric(kappa.all))

p.kappa <- ggplot(kappa.dt, aes(x=kappa)) +
  geom_histogram(aes(y=after_stat(density)),
                 breaks=seq(0, 1, by=0.05), color="black", fill="gray") +
  labs(x = expression(paste("Shrinkage coefficient ", kappa, " imposed on direct effects of instruments")),
       y= "Posterior density")
p.kappa

Coefficients for pleiotropic effects

Summaries of the posterior distribution show that the pleiotropic effect beta[11] which represents the APOC1 locus has escaped shrinkage, but the pleiotropic effects of the other four instruments have been shrunk to near zero.


if(use.sampling) {
  pars.forsummary <- c("theta", "beta", "log_c", "log_tau", "m_eff", "lp__")
} else {
  pars.forsummary <- c("theta", "beta", "log_c", "log_tau", "m_eff")
}

fit.coeffs <- summary(hevo.stanfit,
                      pars=pars.forsummary, 
                      probs=c(0.1, 0.9))$summary 
fit.coeffs <- as.data.table(round(fit.coeffs, 2), keep.rownames="variable")

knitr::kable(fit.coeffs)

variable	mean	se_mean	sd	10%	90%	n_eff	Rhat
theta	-0.39	0.00	0.08	-0.49	-0.29	2261.83	1
beta[1]	-0.01	0.00	0.01	-0.01	0.00	2727.93	1
beta[2]	0.01	0.00	0.01	0.00	0.02	3279.81	1
beta[3]	0.00	0.00	0.00	0.00	0.01	5615.80	1
beta[4]	0.00	0.00	0.00	0.00	0.01	4809.12	1
beta[5]	0.00	0.00	0.00	0.00	0.01	4705.02	1
beta[6]	0.00	0.00	0.00	0.00	0.01	4505.16	1
beta[7]	0.01	0.00	0.01	0.00	0.01	3723.95	1
beta[8]	0.00	0.00	0.00	-0.01	0.00	4778.98	1
beta[9]	0.00	0.00	0.01	-0.01	0.00	2702.92	1
beta[10]	0.00	0.00	0.00	0.00	0.01	5423.68	1
beta[11]	-0.03	0.00	0.01	-0.04	-0.02	5269.51	1
beta[12]	-0.01	0.00	0.01	-0.01	0.00	3086.93	1
log_c	-1.13	0.01	0.63	-1.83	-0.30	3843.97	1
log_tau	-4.97	0.01	0.67	-5.83	-4.17	2714.18	1
m_eff	5.56	0.02	1.24	3.95	7.14	4257.44	1
lp__	-48.91	0.14	6.26	-57.10	-41.06	2043.08	1