MR-Hevo

Paul McKeigue

Introduction

This vignette demonstrates an analysis using MRHevo to overcome some of the limitations of current methods for instrumental variable analysis with genetic instruments (“Mendelian randomization”).

Where multiple SNPs are used to calculate each scalar instrument from summary statistics for the genotype-exposure association, an individual-level dataset with genotypes and outcome is required in step 2 to calculate summary statistics for the effects of these instrument on the outcome. The model-fitting can use either summary statistics (fast) or individual-level data (slow if the dataset is large).

Input dataset

The input dataset should comprise:

In this example the genetic instruments consist of 12 trans-pQTLs for the levels of adiponectin, encoded by the gene ADIPOQ in whole blood. There are also two cis-pQTLs that are not used in the instrumental variable analysis, because cis-acting variants in or near this gene are known to affect the splicing of the transcript. Summary statistics for these pQTLs were derived from the DeCODE proteomics study which used the Somalogic platform

The trans-pQTLs are far enough apart to be modelled as independent. The target dataset is the UK Biobank cohort, and the outcome is type 2 diabetes. The unpenalized covariates are sex and the first five principal components of the genotype matrix.


library(data.table)
library(ggplot2)
library(rstan)
require(knitr)

source("functions.mrhevo.R")

load("data_mrhevo.RData")

if(length(unique(Y)) == 2) {
  logistic <- TRUE
} else {
  logistic <- FALSE
}

As a first step, we compute summary statistics for the effect of each instrument on the outcome, and the ratio of the instrument-outcome coefficient beta_YZ to the instrument-exposure coefficient alpha_hat. An approximate standard error for the ratio theta_IV, defined as beta_YZ / alpha_hat, is calculated by the delta method (a first-order Taylor expansion).

Summary statistics


coeffs.scores <- get_summarystatsforMR(Y, Z, X_u)

coeffs.scores <- Z.metadata[coeffs.scores, on="scoreid"]
coeffs.scores[, thetaIV := beta_YZ / alpha_hat]  # ratio estimates
coeffs.scores[, se.thetaIV := SE_YZ / alpha_hat] 
coeffs.scores[, se.thetaIV_delta := sqrt((SE_YZ / alpha_hat)^2 +
                                     beta_YZ^2 * se.alpha_hat^2 / alpha_hat^4)] # delta method

knitr::kable(coeffs.scores[, .(qtlname, qtl_type, alpha_hat, se.alpha_hat,
                               beta_YZ, SE_YZ, thetaIV, z, p)])
qtlname qtl_type alpha_hat se.alpha_hat beta_YZ SE_YZ thetaIV z p
c1:219.4 trans 0.0509899 0.0070639 -0.0292637 0.0051390 -0.5739115 -5.6944050 0.0000000
c1:77.5 trans 0.0344627 0.0064386 0.0037473 0.0051909 0.1087352 0.7218937 0.4703598
c1:9.9 trans 0.0361292 0.0071903 -0.0109785 0.0052606 -0.3038681 -2.0869398 0.0368936
ALAS1 trans 0.0473569 0.0080025 -0.0173131 0.0051742 -0.3655881 -3.3460674 0.0008197
c6:163.3 trans 0.0254150 0.0051803 -0.0023461 0.0051326 -0.0923101 -0.4570915 0.6476053
TRIB1 trans 0.0438447 0.0073081 -0.0120452 0.0050825 -0.2747241 -2.3699548 0.0177903
MIR511 trans 0.0393044 0.0065677 -0.0046359 0.0050883 -0.1179480 -0.9110785 0.3622540
ADRB1 trans 0.0326294 0.0062769 -0.0177924 0.0050890 -0.5452866 -3.4962561 0.0004718
CMIP trans 0.0539801 0.0005398 -0.0295299 0.0050676 -0.5470516 -5.8271689 0.0000000
c17:67.4 trans 0.0333792 0.0066200 -0.0084370 0.0051326 -0.2527629 -1.6438039 0.1002167
APOC1 trans 0.0327474 0.0061291 -0.0461022 0.0052367 -1.4078132 -8.8036521 0.0000000
LKAAEAR1 trans 0.0340694 0.0063322 -0.0235592 0.0050769 -0.6915072 -4.6404990 0.0000035
ABCC5 cis 0.0566667 0.0114463 0.0059991 0.0051334 0.1058665 1.1686508 0.2425444
ADIPOQ cis 0.3288945 0.0032889 0.0050036 0.0051477 0.0152133 0.9719955 0.3310528

From the summary statistics (excluding the cis-eQTL) we can obtain three widely-used estimators of the causal effect parameter: the inverse-variance weighted mean, the weighted median, and the penalized weighted median. As expected, the weighted median and the penalized weighted median downweight the outlier and give less extreme estimates of the causal effect.

Estimators of the causal effect based on summary statistics


estimators.dt <- get_estimatorsMR(coeffs.scores[qtl_type=="trans"])[, 1:5] 

knitr::kable(estimators.dt)
Estimator Estimate SE z pvalue
Weighted mean -0.3714990 0.0418619 -8.874400 0e+00
Weighted median -0.3475400 0.0679252 -5.116511 3e-07
Penalized weighted median -0.3584349 0.0698595 -5.130796 3e-07

A plot of beta_YZ against alpha_hat shows that 11 of the 12 trans-pQTL instruments are inversely associated with type 2 diabetes. The _cis- pQTLs have large effects on circulating levels of the protein, but no effect on type 2 diabetes; this is this is explicable, as the cis- acting SNPs alter the splicing of ADIPOQ. Of the trans-pQTLs, the APOC1 locus is an outlier, with a more extreme ratio of beta_YZ to alpha_hat than the others.


## size of points inversely proportional to standard error 
coeffs.scores[, size.thetaIV := 0.5 * sum(se.thetaIV_delta) / se.thetaIV_delta] 

gene.query <- "ADIPOQ"
long.phenoname <- "type 2 diabetes"
p.coeffs <- ggplot(coeffs.scores,
                                    aes(x=alpha_hat, y=beta_YZ, color=qtl_type)) + 
      geom_point(aes(size=size.thetaIV), alpha=0.8) +
      scale_size(guide="none") + 
    ggrepel::geom_text_repel(aes(label=qtlname), force=2, size=4, fontface="italic", color="blue") + 
    scale_x_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) + 
    scale_y_continuous(expand =  expansion(mult = c(0.1, 0.1))) + 
    geom_abline(data=estimators.dt,
                aes(slope=Estimate, intercept=rep(0, 3),
                    linetype=Estimator)) +
    theme(legend.position="top") +
    theme(legend.direction="vertical") +
    theme(legend.box="horizontal") +
    guides(color=guide_legend(title="QTL type")) +
    xlab(paste("Effect of genetic instrument on ", gene.query)) +
    ylab(paste("Effect on", long.phenoname)) +
  coord_fixed(ratio=2)
save(p.coeffs, file="p.coeffs.ADIPOQ.RData")
p.coeffs

Sample the posterior density

The function run_mrhevo.sstats will call Stan to sample the posterior distribution of all parameters, given the model and the data consisting of summary statistics. It returns an object of class stanfit

As recommended by Piironen and Vehtari (2017), we supply a prior guess of the fraction of instruments that have nonzero pleiotropic effects. With a large enough sample size, this prior guess will not make much difference to the results. For this example, we guess the fraction to be 0.2. This guess is used to set the scale of the prior on the global shrinkage parameter.

The causal effect parameter theta is assigned a Gaussian prior with mean zero and standard deviation priorsd.theta. As we will divide the posterior by the prior to obtain the likelihood as a function of theta, the setting of this prior should not make much difference but the sampler will be more efficient if the prior standard deviation is set to a value that is supported by the data.

To speed up computation in this large dataset, the statistical model is fitted to summary statistics rather than to individual-level data.


options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

use.sampling <- TRUE
logistic <- TRUE
priorsd_theta <- 1
newrun <- FALSE

which.trans <- colnames(Z) %in% Z.metadata[qtl_type=="trans", scoreid]
J <- length(which.trans)
N <- nrow(Z)

gamma_hat <- coeffs.scores[qtl_type=="trans", beta_YZ]
se.gamma_hat <- coeffs.scores[qtl_type=="trans", SE_YZ]
alpha_hat <- coeffs.scores[qtl_type=="trans", alpha_hat]
se.alpha_hat <- coeffs.scores[qtl_type=="trans", se.alpha_hat]

hevo.stanfit <- run_mrhevo.sstats(alpha_hat, se.alpha_hat,
                                  gamma_hat, se.gamma_hat,
                                  fraction_pleio=0.2,
                                  priorsd_theta=priorsd_theta)
#> compiling stan model ... done
#> Sampling posterior distribution ... 
#> CHECKING DATA AND PREPROCESSING FOR MODEL 'MRHevo.summarystats' NOW.
#> 
#> COMPILING MODEL 'MRHevo.summarystats' NOW.
#> 
#> STARTING SAMPLER FOR MODEL 'MRHevo.summarystats' NOW.
#> Warning: There were 7 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
#> done

Maximum likelihood estimate and p-value

From the posterior samples, we compute the likelihood as a function of theta. We fit a quadratic approximation to the log-likelihood, and use this to obtain a classical maximum likelihood and p-value.

A plot of the posterior density of theta shows that it is approximately Gaussian, and a plot of the log-likelihood shows that it can be approximated by a quadratic function. This guarantees that the value of theta that maximizes the likelihood will have the sampling properties – consistency, unbiasedness, and minimum variance – that are desirable for an “estimator”.


theta.samples <-  unlist(extract(hevo.stanfit, pars="theta"))
prior.theta <- dnorm(theta.samples, mean=0, sd=priorsd_theta)
mle.theta <- mle.se.pval(x=theta.samples, prior=prior.theta)
mle.theta[, Estimator := "Marginalize over direct effects"]

p.bayesianloglik <-  mle.se.pval(x=theta.samples, prior=prior.theta, return.asplot=TRUE)
p.bayesianloglik

Comparison with other estimates

Marginalizing over the distribution of unobserved pleiotropic effects is more conservative than the standard methods for inferring the causal effect parameter, at least in this example.


methods.dt <- rbind(estimators.dt, mle.theta, fill=TRUE)
knitr::kable(methods.dt[, .(Estimator, Estimate, SE, z, pvalue)])
Estimator Estimate SE z pvalue
Weighted mean -0.3714990 0.0418619 -8.874400 0.00e+00
Weighted median -0.3475400 0.0679252 -5.116511 3.00e-07
Penalized weighted median -0.3584349 0.0698595 -5.130796 3.00e-07
Marginalize over direct effects -0.3754964 0.0911071 -4.121484 3.76e-05

Shrinkage coefficients

For each instrument, we can compute from the model parameters a shrinkage coefficent kappa that takes values between 0 (no shrinkage of the pleiotropic effect) and 1 (complete shrinkage of the effect to zero). The posterior distribution of the J shrinkage coefficients has the horseshoe shape specified by the prior.


kappa.all <- extract(hevo.stanfit, pars="kappa")$kappa
kappa.dt <- data.table(kappa=as.numeric(kappa.all))

p.kappa <- ggplot(kappa.dt, aes(x=kappa)) +
  geom_histogram(aes(y=after_stat(density)),
                 breaks=seq(0, 1, by=0.05), color="black", fill="gray") +
  labs(x = expression(paste("Shrinkage coefficient ", kappa, " imposed on direct effects of instruments")),
       y= "Posterior density")
p.kappa

Coefficients for pleiotropic effects

Summaries of the posterior distribution show that the pleiotropic effect beta[11] which represents the APOC1 locus has escaped shrinkage, but the pleiotropic effects of the other four instruments have been shrunk to near zero.


if(use.sampling) {
  pars.forsummary <- c("theta", "beta", "log_c", "log_tau", "m_eff", "lp__")
} else {
  pars.forsummary <- c("theta", "beta", "log_c", "log_tau", "m_eff")
}

fit.coeffs <- summary(hevo.stanfit,
                      pars=pars.forsummary, 
                      probs=c(0.1, 0.9))$summary 
fit.coeffs <- as.data.table(round(fit.coeffs, 2), keep.rownames="variable")

knitr::kable(fit.coeffs)
variable mean se_mean sd 10% 90% n_eff Rhat
theta -0.39 0.00 0.08 -0.49 -0.29 2261.83 1
beta[1] -0.01 0.00 0.01 -0.01 0.00 2727.93 1
beta[2] 0.01 0.00 0.01 0.00 0.02 3279.81 1
beta[3] 0.00 0.00 0.00 0.00 0.01 5615.80 1
beta[4] 0.00 0.00 0.00 0.00 0.01 4809.12 1
beta[5] 0.00 0.00 0.00 0.00 0.01 4705.02 1
beta[6] 0.00 0.00 0.00 0.00 0.01 4505.16 1
beta[7] 0.01 0.00 0.01 0.00 0.01 3723.95 1
beta[8] 0.00 0.00 0.00 -0.01 0.00 4778.98 1
beta[9] 0.00 0.00 0.01 -0.01 0.00 2702.92 1
beta[10] 0.00 0.00 0.00 0.00 0.01 5423.68 1
beta[11] -0.03 0.00 0.01 -0.04 -0.02 5269.51 1
beta[12] -0.01 0.00 0.01 -0.01 0.00 3086.93 1
log_c -1.13 0.01 0.63 -1.83 -0.30 3843.97 1
log_tau -4.97 0.01 0.67 -5.83 -4.17 2714.18 1
m_eff 5.56 0.02 1.24 3.95 7.14 4257.44 1
lp__ -48.91 0.14 6.26 -57.10 -41.06 2043.08 1

Sampler diagnostics

A pairs plot is useful to diagnose problems with the sampler.


if(use.sampling) {
  pairs(hevo.stanfit, pars=c("theta", 
                     "log_c", "log_tau", "m_eff",
                     "lp__"))
}