Getting Started with gwastargets

Overview

gwastargets provides functions for building reproducible multi-cohort GWAS meta-analysis pipelines using the targets framework. It supports both binary (case/control) and quantitative traits.

The typical workflow is:

Assemble a manifest data frame describing your cohort files
Call generate_gwas_meta_pipeline() to produce _targets.R code
Paste the output into your _targets.R and run targets::tar_make()

The pipeline functions (prep_gwas(), meta_analyze_ivw(), etc.) are also available for interactive use outside of targets.

Required ancillary files

Two reference files must be provided before generating a pipeline:

HapMap 3 SNP list (`hm3_path`)

Used by prep_gwas() to match variants and by LDSC for heritability and genetic correlation estimation. The file is loaded via vroom::vroom() and should have columns SNP, A1, A2. The standard LDSC HapMap 3 SNP list (w_hm3.snplist) is available from the Alkes Group LDSCORE repository.

dbSNP155 reference (`dbsnp_path`)

Used by clean_gwas() (via tidyGWAS) for variant QC and rsID harmonization. This should be the path to a local dbSNP155 directory. See the tidyGWAS documentation for instructions on downloading and preparing this reference.

Step 1: Build a cohort manifest

The manifest is a data frame with one row per cohort × ancestry combination. It must contain five columns:

Column	Description
`path`	Full path to the raw summary statistics file
`file`	Basename of the file
`cohort`	Human-readable cohort label
`ancestry`	Ancestry label used by `ldscr` (e.g. `"EUR"`, `"AFR"`, `"EAS"`)
`study`	Unique identifier used as the `targets` target name

library(gwastargets)

manifest <- data.frame(
  path     = c(
    "/project/data/UKBB_EUR.txt.gz",
    "/project/data/MVP_EUR.txt.gz",
    "/project/data/BBJ_EAS.txt.gz"
  ),
  file     = c("UKBB_EUR.txt.gz", "MVP_EUR.txt.gz", "BBJ_EAS.txt.gz"),
  cohort   = c("UKBB",            "MVP",            "BBJ"),
  ancestry = c("EUR",             "EUR",            "EAS"),
  study    = c("UKBB_EUR",        "MVP_EUR",        "BBJ_EAS"),
  stringsAsFactors = FALSE
)

Optional column-mapping columns

When a cohort’s raw summary statistics use non-standard column headers that are not covered by the built-in dictionary (see Raw summary statistics format below), you can add col_* columns to the manifest. Each col_* value is the raw column name in that cohort’s file; use NA for cohorts that don’t need an override.

Supported col_* columns: col_chr, col_pos, col_rsid, col_effect_allele, col_other_allele, col_beta, col_se, col_p, col_eaf, col_n, col_n_cases, col_n_controls. See ?build_column_map for details on what each maps to.

manifest <- data.frame(
  path     = c("/project/data/UKBB_EUR.txt.gz",
               "/project/data/MVP_EUR.txt.gz"),
  file     = c("UKBB_EUR.txt.gz", "MVP_EUR.txt.gz"),
  cohort   = c("UKBB",            "MVP"),
  ancestry = c("EUR",             "EUR"),
  study    = c("UKBB_EUR",        "MVP_EUR"),
  # UKBB uses non-standard names; MVP uses standard names
  col_eaf  = c("MY_FREQ",         NA),
  col_beta = c("BETA_VAL",        NA),
  stringsAsFactors = FALSE
)

generate_gwas_meta_pipeline() validates the manifest before generating any code — it will error if required columns are missing or study values are duplicated, and warn if any path files do not exist on disk.

Step 2: Generate the pipeline

code <- generate_gwas_meta_pipeline(
  trait       = "CAD",
  trait_type  = "binary",      # or "quantitative"
  n_col       = "EffectiveN",  # use "N" for quantitative traits
  manifest_df = manifest,
  hm3_path    = "/path/to/w_hm3.snplist",
  dbsnp_path  = "/path/to/dbSNP155"
)

# Review the generated code
cat(code)

The output is a self-contained block of R code ready to paste into _targets.R. It includes:

The manifest serialized as a tibble::tribble() call
A tar_map() block running per-cohort preparation and LDSC heritability
LDSC genetic correlation QC targets (one per ancestry with ≥ 2 cohorts)
Per-ancestry IVW meta-analysis targets with Manhattan plots and loci
All-populations IVW meta-analysis with MR-MEGA MDS
Cohort overview table and precision plot

Step 3: Per-cohort preparation (`prep_gwas`)

Each cohort’s summary statistics pass through prep_gwas(), which:

Cleans the raw file via clean_gwas() (header harmonization + tidyGWAS QC)
Validates that the expected columns are present for the declared trait_type
Matches variants to the HapMap 3 SNP list via snp_match_munge()
Estimates the LDSC intercept and applies SE correction when intercept > 1
Writes a .parquet file with standardized columns including SE_ldsc_adjusted

# Typically called inside a targets pipeline, but works interactively too
out_path <- prep_gwas(
  sumstats_file = "/project/data/UKBB_EUR.txt.gz",
  hm3           = hm3_snps,              # data frame with SNP, A1, A2
  ancestry      = "EUR",
  output_path   = "Data/CAD/prepped_sumstats",
  trait_type    = "binary",
  n_col         = EffectiveN,            # bare name, not quoted
  dbsnp_path    = "/path/to/dbSNP155"
)

The output parquet file contains the original cleaned columns plus:

Added column	Description
`intercept`	LDSC intercept estimate
`SE_ldsc_adjusted`	`SE * sqrt(intercept)` when intercept > 1, otherwise `SE`
`Z_ldsc_adjusted`	`B / SE_ldsc_adjusted`
`P_ldsc_adjusted`	Two-sided p-value from `Z_ldsc_adjusted`
`ldsc_adjustment`	`TRUE` if intercept correction was applied

Step 4: Cohort QC with `summarize_sumstats`

Before meta-analysis, inspect per-cohort statistics:

prepped_files <- list.files("Data/CAD/prepped_sumstats", full.names = TRUE,
                            pattern = "\\.parquet$")

summary_tbl <- summarize_sumstats(prepped_files, trait_type = "binary")

For binary traits, the summary includes n_cases, n_controls, n_effective, and n_total. For quantitative traits, only n_total is returned. Both include SNP counts, EAF range, and median SE — used to build the cohort overview table and precision plot in the generated pipeline.

Step 5: IVW meta-analysis

meta_analyze_ivw() runs a fixed-effects inverse-variance weighted meta-analysis across all prepped parquet files, one chromosome at a time:

meta_results <- meta_analyze_ivw(
  parquet_files = prepped_files,
  trait_type    = "binary",
  se_col        = SE_ldsc_adjusted,   # use LDSC-adjusted SE
  chromosomes   = 1:22,
  min_mac       = 100                  # minimum minor allele count filter
)

Output columns include: B, SE, p_value, z_score, EAF, n_contributions, N (and for binary traits: CaseN, ControlN, EffectiveN), plus heterogeneity statistics Q, Q_df, Q_pval, and I2.

Step 6: Loci extraction

loci <- extract_loci(meta_results)

Calls gwasRtools::get_loci() for LD-based clumping and gwasRtools::get_nearest_gene() for gene annotation. Returns a tibble with one row per independent locus and columns for lead variant, gene, and association statistics.

Step 7: MR-MEGA MDS

For multi-ancestry analyses, MR-MEGA requires a matrix of per-study effect estimates at common variants. extract_common_variants() identifies variants present in all studies above a MAF threshold, and calculate_mr_mega_mds() computes MDS coordinates from the resulting effect heterogeneity matrix:

common_vars <- extract_common_variants(
  parquet_files        = prepped_files,
  trait_type           = "binary",
  select_per_mb_window = TRUE   # LD-thin to ~1 variant/Mb
)

mds <- calculate_mr_mega_mds(common_vars)
# mds$pc_coordinates: study × PC matrix
# mds$distance_matrix: pairwise effect-size distance matrix

The MDS coordinates are passed as covariates to MR-MEGA to model ancestry-driven effect heterogeneity.

Raw summary statistics format

prep_gwas() accepts a wide range of column naming conventions. harmonize_sumstats_headers() (called internally by clean_gwas()) maps common aliases to the standard names used throughout this package:

Standard name	Common aliases accepted
`B`	`BETA`, `Effect`, `b`
`SE`	`StdErr`, `standard_error`
`P`	`PVAL`, `P_VALUE`, `pvalue`, `p.value`
`EAF`	`EFF_ALL_FREQ`, `A1FREQ`, `AF_Allele2`
`N`	`Neff`, `NMISS`, `OBS_CT`
`CaseN`	`N_CASE`, `Ncases`
`ControlN`	`N_CONTROL`, `Ncontrols`

When using generate_gwas_meta_pipeline(), additional mappings are handled automatically via col_* columns in the manifest (see [Step 1] above). For interactive use, pass a named character vector to the column_map argument of prep_gwas().