This vignette demonstrates an analysis using MRHevo
to overcome some of the limitations of current methods for instrumental variable analysis with genetic instruments (“Mendelian randomization”).
Multiple variants are used to construct each unlinked instrument, rather than selecting a single SNP from each genomic region that is associated with the exposure.
There is no need to exclude weak instruments because they are correctly handled by Bayesian inference.
Inference of causality in the presence of pleiotropic effects is based on marginalizing over the distribution of pleiotropic effects to compute the likelihood as a function of the causal effect parameter, rather than on constructing “estimators” that downweight or exclude those instruments that appear to be outliers. The statistical model is described on [this page](https://github.com/molepi-precmed/mrhevo/blob/main/theorymethods.pdf.
Where multiple SNPs are used to calculate each scalar instrument from summary statistics for the genotype-exposure association, an individual-level dataset with genotypes and outcome is required in step 2 to calculate summary statistics for the effects of these instrument on the outcome. The model-fitting can use either summary statistics (fast) or individual-level data (slow if the dataset is large).
The input dataset should comprise:
Y
a vector of length N
containing outcomes (in this example type 2 diabetes as a binary outcome
Z
an N x J
data.table of unlinked genetic instruments.
X_u
an N x U
data.table of unpenalized covariates (this will usually include scores on the first few principal components of the genotype matrix
Z.metadata
a data.table with J
rows, one for each genetic instrument. Estimated coefficients for the effects of instruments Z
on the exposure, and the corresponding standard errors of these estimates are in columns named alpha_hat
and se_alpha_hat
respectively. The column scoreid
should contain the names of the columns of Z
. This table can include other metadata such as the genomic position of each genotypic instrument, or the names of nearby genes.
In this example the genetic instruments consist of 12 trans-pQTLs for the levels of adiponectin, encoded by the gene ADIPOQ in whole blood. There are also two cis-pQTLs that are not used in the instrumental variable analysis, because cis-acting variants in or near this gene are known to affect the splicing of the transcript. Summary statistics for these pQTLs were derived from the DeCODE proteomics study which used the Somalogic platform
The trans-pQTLs are far enough apart to be modelled as independent. The target dataset is the UK Biobank cohort, and the outcome is type 2 diabetes. The unpenalized covariates are sex and the first five principal components of the genotype matrix.
library(data.table)
library(ggplot2)
library(rstan)
require(knitr)
source("functions.mrhevo.R")
load("data_mrhevo.RData")
if(length(unique(Y)) == 2) {
TRUE
logistic <-else {
} FALSE
logistic <- }
As a first step, we compute summary statistics for the effect of each instrument on the outcome, and the ratio of the instrument-outcome coefficient beta_YZ
to the instrument-exposure coefficient alpha_hat
. An approximate standard error for the ratio theta_IV
, defined as beta_YZ / alpha_hat
, is calculated by the delta method (a first-order Taylor expansion).
get_summarystatsforMR(Y, Z, X_u)
coeffs.scores <-
Z.metadata[coeffs.scores, on="scoreid"]
coeffs.scores <-:= beta_YZ / alpha_hat] # ratio estimates
coeffs.scores[, thetaIV := SE_YZ / alpha_hat]
coeffs.scores[, se.thetaIV := sqrt((SE_YZ / alpha_hat)^2 +
coeffs.scores[, se.thetaIV_delta beta_YZ^2 * se.alpha_hat^2 / alpha_hat^4)] # delta method
::kable(coeffs.scores[, .(qtlname, qtl_type, alpha_hat, se.alpha_hat,
knitr beta_YZ, SE_YZ, thetaIV, z, p)])
qtlname | qtl_type | alpha_hat | se.alpha_hat | beta_YZ | SE_YZ | thetaIV | z | p |
---|---|---|---|---|---|---|---|---|
c1:219.4 | trans | 0.0509899 | 0.0070639 | -0.0292637 | 0.0051390 | -0.5739115 | -5.6944050 | 0.0000000 |
c1:77.5 | trans | 0.0344627 | 0.0064386 | 0.0037473 | 0.0051909 | 0.1087352 | 0.7218937 | 0.4703598 |
c1:9.9 | trans | 0.0361292 | 0.0071903 | -0.0109785 | 0.0052606 | -0.3038681 | -2.0869398 | 0.0368936 |
ALAS1 | trans | 0.0473569 | 0.0080025 | -0.0173131 | 0.0051742 | -0.3655881 | -3.3460674 | 0.0008197 |
c6:163.3 | trans | 0.0254150 | 0.0051803 | -0.0023461 | 0.0051326 | -0.0923101 | -0.4570915 | 0.6476053 |
TRIB1 | trans | 0.0438447 | 0.0073081 | -0.0120452 | 0.0050825 | -0.2747241 | -2.3699548 | 0.0177903 |
MIR511 | trans | 0.0393044 | 0.0065677 | -0.0046359 | 0.0050883 | -0.1179480 | -0.9110785 | 0.3622540 |
ADRB1 | trans | 0.0326294 | 0.0062769 | -0.0177924 | 0.0050890 | -0.5452866 | -3.4962561 | 0.0004718 |
CMIP | trans | 0.0539801 | 0.0005398 | -0.0295299 | 0.0050676 | -0.5470516 | -5.8271689 | 0.0000000 |
c17:67.4 | trans | 0.0333792 | 0.0066200 | -0.0084370 | 0.0051326 | -0.2527629 | -1.6438039 | 0.1002167 |
APOC1 | trans | 0.0327474 | 0.0061291 | -0.0461022 | 0.0052367 | -1.4078132 | -8.8036521 | 0.0000000 |
LKAAEAR1 | trans | 0.0340694 | 0.0063322 | -0.0235592 | 0.0050769 | -0.6915072 | -4.6404990 | 0.0000035 |
ABCC5 | cis | 0.0566667 | 0.0114463 | 0.0059991 | 0.0051334 | 0.1058665 | 1.1686508 | 0.2425444 |
ADIPOQ | cis | 0.3288945 | 0.0032889 | 0.0050036 | 0.0051477 | 0.0152133 | 0.9719955 | 0.3310528 |
From the summary statistics (excluding the cis-eQTL) we can obtain three widely-used estimators of the causal effect parameter: the inverse-variance weighted mean, the weighted median, and the penalized weighted median. As expected, the weighted median and the penalized weighted median downweight the outlier and give less extreme estimates of the causal effect.
get_estimatorsMR(coeffs.scores[qtl_type=="trans"])[, 1:5]
estimators.dt <-
::kable(estimators.dt) knitr
Estimator | Estimate | SE | z | pvalue |
---|---|---|---|---|
Weighted mean | -0.3714990 | 0.0418619 | -8.874400 | 0e+00 |
Weighted median | -0.3475400 | 0.0679252 | -5.116511 | 3e-07 |
Penalized weighted median | -0.3584349 | 0.0698595 | -5.130796 | 3e-07 |
A plot of beta_YZ
against alpha_hat
shows that 11 of the 12 trans-pQTL instruments are inversely associated with type 2 diabetes. The _cis- pQTLs have large effects on circulating levels of the protein, but no effect on type 2 diabetes; this is this is explicable, as the cis- acting SNPs alter the splicing of ADIPOQ. Of the trans-pQTLs, the APOC1 locus is an outlier, with a more extreme ratio of beta_YZ
to alpha_hat
than the others.
## size of points inversely proportional to standard error
:= 0.5 * sum(se.thetaIV_delta) / se.thetaIV_delta]
coeffs.scores[, size.thetaIV
"ADIPOQ"
gene.query <- "type 2 diabetes"
long.phenoname <- ggplot(coeffs.scores,
p.coeffs <-aes(x=alpha_hat, y=beta_YZ, color=qtl_type)) +
geom_point(aes(size=size.thetaIV), alpha=0.8) +
scale_size(guide="none") +
ggrepel::geom_text_repel(aes(label=qtlname), force=2, size=4, fontface="italic", color="blue") +
scale_x_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) +
scale_y_continuous(expand = expansion(mult = c(0.1, 0.1))) +
geom_abline(data=estimators.dt,
aes(slope=Estimate, intercept=rep(0, 3),
linetype=Estimator)) +
theme(legend.position="top") +
theme(legend.direction="vertical") +
theme(legend.box="horizontal") +
guides(color=guide_legend(title="QTL type")) +
xlab(paste("Effect of genetic instrument on ", gene.query)) +
ylab(paste("Effect on", long.phenoname)) +
coord_fixed(ratio=2)
save(p.coeffs, file="p.coeffs.ADIPOQ.RData")
p.coeffs
The function run_mrhevo.sstats
will call Stan
to sample the posterior distribution of all parameters, given the model and the data consisting of summary statistics. It returns an object of class stanfit
As recommended by Piironen and Vehtari (2017), we supply a prior guess of the fraction of instruments that have nonzero pleiotropic effects. With a large enough sample size, this prior guess will not make much difference to the results. For this example, we guess the fraction to be 0.2. This guess is used to set the scale of the prior on the global shrinkage parameter.
The causal effect parameter theta
is assigned a Gaussian prior with mean zero and standard deviation priorsd.theta
. As we will divide the posterior by the prior to obtain the likelihood as a function of theta
, the setting of this prior should not make much difference but the sampler will be more efficient if the prior standard deviation is set to a value that is supported by the data.
To speed up computation in this large dataset, the statistical model is fitted to summary statistics rather than to individual-level data.
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
TRUE
use.sampling <- TRUE
logistic <- 1
priorsd_theta <- FALSE
newrun <-
colnames(Z) %in% Z.metadata[qtl_type=="trans", scoreid]
which.trans <- length(which.trans)
J <- nrow(Z)
N <-
coeffs.scores[qtl_type=="trans", beta_YZ]
gamma_hat <- coeffs.scores[qtl_type=="trans", SE_YZ]
se.gamma_hat <- coeffs.scores[qtl_type=="trans", alpha_hat]
alpha_hat <- coeffs.scores[qtl_type=="trans", se.alpha_hat]
se.alpha_hat <-
run_mrhevo.sstats(alpha_hat, se.alpha_hat,
hevo.stanfit <-
gamma_hat, se.gamma_hat,fraction_pleio=0.2,
priorsd_theta=priorsd_theta)
#> compiling stan model ... done
#> Sampling posterior distribution ...
#> CHECKING DATA AND PREPROCESSING FOR MODEL 'MRHevo.summarystats' NOW.
#>
#> COMPILING MODEL 'MRHevo.summarystats' NOW.
#>
#> STARTING SAMPLER FOR MODEL 'MRHevo.summarystats' NOW.
#> Warning: There were 7 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
#> done
From the posterior samples, we compute the likelihood as a function of theta
. We fit a quadratic approximation to the log-likelihood, and use this to obtain a classical maximum likelihood and p-value.
A plot of the posterior density of theta
shows that it is approximately Gaussian, and a plot of the log-likelihood shows that it can be approximated by a quadratic function. This guarantees that the value of theta
that maximizes the likelihood will have the sampling properties – consistency, unbiasedness, and minimum variance – that are desirable for an “estimator”.
unlist(extract(hevo.stanfit, pars="theta"))
theta.samples <- dnorm(theta.samples, mean=0, sd=priorsd_theta)
prior.theta <- mle.se.pval(x=theta.samples, prior=prior.theta)
mle.theta <-:= "Marginalize over direct effects"]
mle.theta[, Estimator
mle.se.pval(x=theta.samples, prior=prior.theta, return.asplot=TRUE)
p.bayesianloglik <- p.bayesianloglik
Marginalizing over the distribution of unobserved pleiotropic effects is more conservative than the standard methods for inferring the causal effect parameter, at least in this example.
rbind(estimators.dt, mle.theta, fill=TRUE)
methods.dt <-::kable(methods.dt[, .(Estimator, Estimate, SE, z, pvalue)]) knitr
Estimator | Estimate | SE | z | pvalue |
---|---|---|---|---|
Weighted mean | -0.3714990 | 0.0418619 | -8.874400 | 0.00e+00 |
Weighted median | -0.3475400 | 0.0679252 | -5.116511 | 3.00e-07 |
Penalized weighted median | -0.3584349 | 0.0698595 | -5.130796 | 3.00e-07 |
Marginalize over direct effects | -0.3754964 | 0.0911071 | -4.121484 | 3.76e-05 |
For each instrument, we can compute from the model parameters a shrinkage coefficent kappa
that takes values between 0 (no shrinkage of the pleiotropic effect) and 1 (complete shrinkage of the effect to zero). The posterior distribution of the J
shrinkage coefficients has the horseshoe shape specified by the prior.
extract(hevo.stanfit, pars="kappa")$kappa
kappa.all <- data.table(kappa=as.numeric(kappa.all))
kappa.dt <-
ggplot(kappa.dt, aes(x=kappa)) +
p.kappa <- geom_histogram(aes(y=after_stat(density)),
breaks=seq(0, 1, by=0.05), color="black", fill="gray") +
labs(x = expression(paste("Shrinkage coefficient ", kappa, " imposed on direct effects of instruments")),
y= "Posterior density")
p.kappa
Summaries of the posterior distribution show that the pleiotropic effect beta[11]
which represents the APOC1 locus has escaped shrinkage, but the pleiotropic effects of the other four instruments have been shrunk to near zero.
if(use.sampling) {
c("theta", "beta", "log_c", "log_tau", "m_eff", "lp__")
pars.forsummary <-else {
} c("theta", "beta", "log_c", "log_tau", "m_eff")
pars.forsummary <-
}
summary(hevo.stanfit,
fit.coeffs <-pars=pars.forsummary,
probs=c(0.1, 0.9))$summary
as.data.table(round(fit.coeffs, 2), keep.rownames="variable")
fit.coeffs <-
::kable(fit.coeffs) knitr
variable | mean | se_mean | sd | 10% | 90% | n_eff | Rhat |
---|---|---|---|---|---|---|---|
theta | -0.39 | 0.00 | 0.08 | -0.49 | -0.29 | 2261.83 | 1 |
beta[1] | -0.01 | 0.00 | 0.01 | -0.01 | 0.00 | 2727.93 | 1 |
beta[2] | 0.01 | 0.00 | 0.01 | 0.00 | 0.02 | 3279.81 | 1 |
beta[3] | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 5615.80 | 1 |
beta[4] | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 4809.12 | 1 |
beta[5] | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 4705.02 | 1 |
beta[6] | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 4505.16 | 1 |
beta[7] | 0.01 | 0.00 | 0.01 | 0.00 | 0.01 | 3723.95 | 1 |
beta[8] | 0.00 | 0.00 | 0.00 | -0.01 | 0.00 | 4778.98 | 1 |
beta[9] | 0.00 | 0.00 | 0.01 | -0.01 | 0.00 | 2702.92 | 1 |
beta[10] | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 5423.68 | 1 |
beta[11] | -0.03 | 0.00 | 0.01 | -0.04 | -0.02 | 5269.51 | 1 |
beta[12] | -0.01 | 0.00 | 0.01 | -0.01 | 0.00 | 3086.93 | 1 |
log_c | -1.13 | 0.01 | 0.63 | -1.83 | -0.30 | 3843.97 | 1 |
log_tau | -4.97 | 0.01 | 0.67 | -5.83 | -4.17 | 2714.18 | 1 |
m_eff | 5.56 | 0.02 | 1.24 | 3.95 | 7.14 | 4257.44 | 1 |
lp__ | -48.91 | 0.14 | 6.26 | -57.10 | -41.06 | 2043.08 | 1 |
A pairs plot is useful to diagnose problems with the sampler.
if(use.sampling) {
pairs(hevo.stanfit, pars=c("theta",
"log_c", "log_tau", "m_eff",
"lp__"))
}