Overview
gwastargets provides functions for building reproducible
multi-cohort GWAS meta-analysis pipelines using the targets
framework. It supports both binary (case/control) and quantitative
traits.
The typical workflow is:
- Assemble a manifest data frame describing your cohort files
- Call
generate_gwas_meta_pipeline()to produce_targets.Rcode - Paste the output into your
_targets.Rand runtargets::tar_make()
The pipeline functions (prep_gwas(),
meta_analyze_ivw(), etc.) are also available for
interactive use outside of targets.
Required ancillary files
Two reference files must be provided before generating a pipeline:
HapMap 3 SNP list (hm3_path)
Used by prep_gwas() to match variants and by LDSC for
heritability and genetic correlation estimation. The file is loaded via
vroom::vroom() and should have columns SNP,
A1, A2. The standard LDSC HapMap 3 SNP list
(w_hm3.snplist) is available from the Alkes Group
LDSCORE repository.
dbSNP155 reference (dbsnp_path)
Used by clean_gwas() (via tidyGWAS) for
variant QC and rsID harmonization. This should be the path to a local
dbSNP155 directory. See the tidyGWAS
documentation for instructions on downloading and preparing this
reference.
Step 1: Build a cohort manifest
The manifest is a data frame with one row per cohort × ancestry combination. It must contain five columns:
| Column | Description |
|---|---|
path |
Full path to the raw summary statistics file |
file |
Basename of the file |
cohort |
Human-readable cohort label |
ancestry |
Ancestry label used by ldscr (e.g. "EUR",
"AFR", "EAS") |
study |
Unique identifier used as the targets target name |
library(gwastargets)
manifest <- data.frame(
path = c(
"/project/data/UKBB_EUR.txt.gz",
"/project/data/MVP_EUR.txt.gz",
"/project/data/BBJ_EAS.txt.gz"
),
file = c("UKBB_EUR.txt.gz", "MVP_EUR.txt.gz", "BBJ_EAS.txt.gz"),
cohort = c("UKBB", "MVP", "BBJ"),
ancestry = c("EUR", "EUR", "EAS"),
study = c("UKBB_EUR", "MVP_EUR", "BBJ_EAS"),
stringsAsFactors = FALSE
)Optional column-mapping columns
When a cohort’s raw summary statistics use non-standard column
headers that are not covered by the built-in dictionary (see Raw summary statistics format
below), you can add col_* columns to the manifest. Each
col_* value is the raw column name in that cohort’s file;
use NA for cohorts that don’t need an override.
Supported col_* columns: col_chr,
col_pos, col_rsid,
col_effect_allele, col_other_allele,
col_beta, col_se, col_p,
col_eaf, col_n, col_n_cases,
col_n_controls. See ?build_column_map for
details on what each maps to.
manifest <- data.frame(
path = c("/project/data/UKBB_EUR.txt.gz",
"/project/data/MVP_EUR.txt.gz"),
file = c("UKBB_EUR.txt.gz", "MVP_EUR.txt.gz"),
cohort = c("UKBB", "MVP"),
ancestry = c("EUR", "EUR"),
study = c("UKBB_EUR", "MVP_EUR"),
# UKBB uses non-standard names; MVP uses standard names
col_eaf = c("MY_FREQ", NA),
col_beta = c("BETA_VAL", NA),
stringsAsFactors = FALSE
)generate_gwas_meta_pipeline() validates the manifest
before generating any code — it will error if required columns are
missing or study values are duplicated, and warn if any
path files do not exist on disk.
Step 2: Generate the pipeline
code <- generate_gwas_meta_pipeline(
trait = "CAD",
trait_type = "binary", # or "quantitative"
n_col = "EffectiveN", # use "N" for quantitative traits
manifest_df = manifest,
hm3_path = "/path/to/w_hm3.snplist",
dbsnp_path = "/path/to/dbSNP155"
)
# Review the generated code
cat(code)The output is a self-contained block of R code ready to paste into
_targets.R. It includes:
- The manifest serialized as a
tibble::tribble()call - A
tar_map()block running per-cohort preparation and LDSC heritability - LDSC genetic correlation QC targets (one per ancestry with ≥ 2 cohorts)
- Per-ancestry IVW meta-analysis targets with Manhattan plots and loci
- All-populations IVW meta-analysis with MR-MEGA MDS
- Cohort overview table and precision plot
Step 3: Per-cohort preparation (prep_gwas)
Each cohort’s summary statistics pass through
prep_gwas(), which:
-
Cleans the raw file via
clean_gwas()(header harmonization +tidyGWASQC) -
Validates that the expected columns are present for
the declared
trait_type -
Matches variants to the HapMap 3 SNP list via
snp_match_munge() - Estimates the LDSC intercept and applies SE correction when intercept > 1
-
Writes a
.parquetfile with standardized columns includingSE_ldsc_adjusted
# Typically called inside a targets pipeline, but works interactively too
out_path <- prep_gwas(
sumstats_file = "/project/data/UKBB_EUR.txt.gz",
hm3 = hm3_snps, # data frame with SNP, A1, A2
ancestry = "EUR",
output_path = "Data/CAD/prepped_sumstats",
trait_type = "binary",
n_col = EffectiveN, # bare name, not quoted
dbsnp_path = "/path/to/dbSNP155"
)The output parquet file contains the original cleaned columns plus:
| Added column | Description |
|---|---|
intercept |
LDSC intercept estimate |
SE_ldsc_adjusted |
SE * sqrt(intercept) when intercept > 1, otherwise
SE
|
Z_ldsc_adjusted |
B / SE_ldsc_adjusted |
P_ldsc_adjusted |
Two-sided p-value from Z_ldsc_adjusted
|
ldsc_adjustment |
TRUE if intercept correction was applied |
Step 4: Cohort QC with summarize_sumstats
Before meta-analysis, inspect per-cohort statistics:
prepped_files <- list.files("Data/CAD/prepped_sumstats", full.names = TRUE,
pattern = "\\.parquet$")
summary_tbl <- summarize_sumstats(prepped_files, trait_type = "binary")For binary traits, the summary includes n_cases,
n_controls, n_effective, and
n_total. For quantitative traits, only n_total
is returned. Both include SNP counts, EAF range, and median SE — used to
build the cohort overview table and precision plot in the generated
pipeline.
Step 5: IVW meta-analysis
meta_analyze_ivw() runs a fixed-effects inverse-variance
weighted meta-analysis across all prepped parquet files, one chromosome
at a time:
meta_results <- meta_analyze_ivw(
parquet_files = prepped_files,
trait_type = "binary",
se_col = SE_ldsc_adjusted, # use LDSC-adjusted SE
chromosomes = 1:22,
min_mac = 100 # minimum minor allele count filter
)Output columns include: B, SE,
p_value, z_score, EAF,
n_contributions, N (and for binary traits:
CaseN, ControlN, EffectiveN),
plus heterogeneity statistics Q, Q_df,
Q_pval, and I2.
Step 6: Loci extraction
loci <- extract_loci(meta_results)Calls gwasRtools::get_loci() for LD-based clumping and
gwasRtools::get_nearest_gene() for gene annotation. Returns
a tibble with one row per independent locus and columns for lead
variant, gene, and association statistics.
Step 7: MR-MEGA MDS
For multi-ancestry analyses, MR-MEGA requires a matrix of per-study
effect estimates at common variants.
extract_common_variants() identifies variants present in
all studies above a MAF threshold, and
calculate_mr_mega_mds() computes MDS coordinates from the
resulting effect heterogeneity matrix:
common_vars <- extract_common_variants(
parquet_files = prepped_files,
trait_type = "binary",
select_per_mb_window = TRUE # LD-thin to ~1 variant/Mb
)
mds <- calculate_mr_mega_mds(common_vars)
# mds$pc_coordinates: study × PC matrix
# mds$distance_matrix: pairwise effect-size distance matrixThe MDS coordinates are passed as covariates to MR-MEGA to model ancestry-driven effect heterogeneity.
Raw summary statistics format
prep_gwas() accepts a wide range of column naming
conventions. harmonize_sumstats_headers() (called
internally by clean_gwas()) maps common aliases to the
standard names used throughout this package:
| Standard name | Common aliases accepted |
|---|---|
B |
BETA, Effect, b
|
SE |
StdErr, standard_error
|
P |
PVAL, P_VALUE, pvalue,
p.value
|
EAF |
EFF_ALL_FREQ, A1FREQ,
AF_Allele2
|
N |
Neff, NMISS, OBS_CT
|
CaseN |
N_CASE, Ncases
|
ControlN |
N_CONTROL, Ncontrols
|
When using generate_gwas_meta_pipeline(), additional
mappings are handled automatically via col_* columns in the
manifest (see [Step 1] above). For interactive use, pass a named
character vector to the column_map argument of
prep_gwas().