Adapter cookbook: from split_spec to native resamples

Selçuk Korkmaz

2026-07-03

What this vignette is for

splitGraph ends at a split_spec object. It deliberately knows nothing about rsample, tidymodels, or any other resampling engine. The handoff contract is the sample_data table inside the spec plus a few scalar fields (group_var, block_vars, time_var, ordering_required, recommended_resampling), together with provenance the adapter can inspect to choose a strategy (constraint_mode, constraint_strategy).

You do not always have to write this glue yourself. The reference downstream consumer, bioLeak, takes a split_spec directly — bioLeak::as_leaksplits(spec, data, outcome) builds an executable, leakage-audited split plan from it. This cookbook is for the other case: when you want to feed a split_spec into a different engine, or understand exactly what a consumer has to honor. It shows three small, self-contained adapters that turn a split_spec into something a downstream workflow can use:

  1. A base-R adapter that returns a list of (train, test) row-index pairs — runnable here, no extra dependencies.
  2. An rsample::group_vfold_cv() adapter for grouped cross-validation keyed to group_id.
  3. An rsample::rolling_origin() adapter for ordered evaluation keyed to order_rank.

Adapters 2 and 3 show idiomatic glue but are not evaluated in this vignette so that splitGraph does not pick up rsample as a build-time dependency.

The same pattern works for any other resampling library you happen to use.

Build a split_spec to work with

meta <- data.frame(
  sample_id    = c("S1", "S2", "S3", "S4", "S5", "S6"),
  subject_id   = c("P1", "P1", "P2", "P2", "P3", "P3"),
  batch_id     = c("B1", "B2", "B1", "B2", "B1", "B2"),
  timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
  time_index   = c(0, 1, 0, 1, 0, 1),
  outcome_id   = c("ctrl", "case", "ctrl", "case", "case", "ctrl"),
  stringsAsFactors = FALSE
)

g <- graph_from_metadata(meta, graph_name = "cookbook")
subject_constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(subject_constraint, graph = g)
spec
#> <split_spec> subject 
#>   Samples: 6 
#>   Groups: 3 
#>   Recommended resampling: grouped_cv

The sample_data table is the contract:

as.data.frame(spec)[, c("sample_id", "group_id", "batch_group", "order_rank")]
#>   sample_id   group_id batch_group order_rank
#> 1        S1 subject:P1          B1          1
#> 2        S2 subject:P1          B2          2
#> 3        S3 subject:P2          B1          1
#> 4        S4 subject:P2          B2          2
#> 5        S5 subject:P3          B1          1
#> 6        S6 subject:P3          B2          2

Adapter 1 — base R: leave-one-group-out folds

This is the simplest meaningful adapter. It groups by whatever split_spec$group_var says is the split unit, and returns one held-out group per fold.

logo_folds <- function(spec, observation_data, sample_id_col = "sample_id") {
  stopifnot(inherits(spec, "split_spec"))
  if (!sample_id_col %in% names(observation_data)) {
    stop("`observation_data` must contain a `", sample_id_col, "` column.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$group_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )
  joined$.row <- seq_len(nrow(joined))
  groups <- split(joined$.row, joined[[spec$group_var]])

  lapply(names(groups), function(g) {
    list(
      group   = g,
      train   = unlist(groups[setdiff(names(groups), g)], use.names = FALSE),
      assess  = groups[[g]]
    )
  })
}

# Pretend we have an observation frame keyed by sample_id.
obs <- data.frame(
  sample_id = meta$sample_id,
  x = rnorm(nrow(meta)),
  y = rbinom(nrow(meta), 1, 0.5)
)

folds <- logo_folds(spec, obs)
length(folds)
#> [1] 3
folds[[1]]
#> $group
#> [1] "subject:P1"
#> 
#> $train
#> [1] 3 4 5 6
#> 
#> $assess
#> [1] 1 2

That is the entire downstream contract: take spec, take an observation frame, return train/assess index lists. Anything more complicated is specific to a resampling library.

Honoring block variables

group_var is the primary split unit, but split_spec also advertises coarser block_vars — dependency axes that should ideally not straddle a fold even when they are not the grouping unit. They are per-sample columns aligned to sample_id, so an adapter reads them exactly like group_var:

spec$block_vars
#> [1] "batch_group"
head(spec$sample_data[, c("sample_id", spec$group_var, spec$block_vars)])
#>   sample_id   group_id batch_group
#> 1        S1 subject:P1          B1
#> 2        S2 subject:P1          B2
#> 3        S3 subject:P2          B1
#> 4        S4 subject:P2          B2
#> 5        S5 subject:P3          B1
#> 6        S6 subject:P3          B2

A block-aware adapter can pass these to a resampler’s blocking/strata argument, or simply audit its folds. Here we check whether any batch straddles the train/assess boundary — a leak a subject-only split does not prevent, and exactly what carrying batch_group on the spec lets a consumer catch:

block <- spec$block_vars[[1]]
block_of <- setNames(spec$sample_data[[block]], spec$sample_data$sample_id)

do.call(rbind, lapply(folds, function(f) {
  data.frame(
    held_out_group    = f$group,
    straddling_batches = paste(
      intersect(block_of[obs$sample_id[f$train]],
                block_of[obs$sample_id[f$assess]]),
      collapse = ", "
    )
  )
}))
#>   held_out_group straddling_batches
#> 1     subject:P1             B1, B2
#> 2     subject:P2             B1, B2
#> 3     subject:P3             B1, B2

Every batch appears on both sides, because grouping by subject does not also block by batch. Whether that matters is a scientific decision — the point is that the spec carries enough information for the adapter to make it.

Adapter 2 — rsample::group_vfold_cv()

Grouped CV keyed to group_id. The downstream package would typically ship something like this; the adapter is short enough that you can paste it into your own analysis script.

spec_to_group_vfold <- function(spec, observation_data,
                                v = NULL,
                                sample_id_col = "sample_id") {
  stopifnot(inherits(spec, "split_spec"))
  if (!requireNamespace("rsample", quietly = TRUE)) {
    stop("Install rsample to use this adapter.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$group_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )

  n_groups <- length(unique(joined[[spec$group_var]]))
  if (is.null(v)) v <- n_groups

  rsample::group_vfold_cv(
    data  = joined,
    group = !!spec$group_var,
    v     = v
  )
}

v = NULL (the default above) gives leave-one-group-out, which is the right default when splitGraph has already grouped samples by their deepest leakage-relevant unit (e.g. subject). Pick a smaller v for k-fold-style grouped CV.

Adapter 3 — rsample::rolling_origin()

When spec$ordering_required is TRUE (or spec$time_var is set), the right downstream object is an ordered split rather than a grouped one.

spec_to_rolling_origin <- function(spec, observation_data,
                                   sample_id_col = "sample_id",
                                   initial = NULL,
                                   assess = 1L) {
  stopifnot(inherits(spec, "split_spec"))
  if (is.null(spec$time_var)) {
    stop("This split_spec has no `time_var`; ordered evaluation is not available.")
  }
  if (!requireNamespace("rsample", quietly = TRUE)) {
    stop("Install rsample to use this adapter.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$time_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )
  ordered <- joined[order(joined[[spec$time_var]]), , drop = FALSE]

  if (is.null(initial)) initial <- max(1L, floor(nrow(ordered) * 0.6))
  rsample::rolling_origin(ordered, initial = initial, assess = assess)
}

The key idea: splitGraph puts ordering information on the spec; the adapter is just a thin shim that consumes it.

Going across language boundaries via JSON

If the downstream consumer is not in R, write the spec to JSON and let the consumer interpret it. The on-disk format is a formal, versioned contract: it has a JSON Schema (Draft 2020-12) shipped in inst/schema/, each file names it via a $schema key, and validate_split_spec_json() checks a file against it before you consume it.

tmp <- tempfile(fileext = ".json")
write_split_spec(spec, tmp)

# The file opens with its $schema reference and schema_version.
cat(readLines(tmp, n = 5), sep = "\n")
#> {
#>   "$schema": "https://raw.githubusercontent.com/selcukorkmaz/splitGraph/main/inst/schema/split_spec.schema.json",
#>   "splitGraph_object": "split_spec",
#>   "schema_version": "0.2.0",
#>   "group_var": "group_id",

# Validate the file against the shipped JSON Schema, then read it back exactly.
validate_split_spec_json(tmp)$valid
#> [1] TRUE
spec2 <- read_split_spec(tmp)
identical(spec$sample_data$group_id, spec2$sample_data$group_id)
#> [1] TRUE

unlink(tmp)

You do not have to write a JSON parser to consume this from Python: the package ships a pure-Python reference reader (inst/python/splitspec) that recovers the same grouping and ordering and drives scikit-learn GroupKFold / TimeSeriesSplit. The cross-language-handoff vignette walks the full R → JSON → Python → scikit-learn path:

vignette("cross-language-handoff", package = "splitGraph")

The same read/write pair exists for dependency_graph (write_dependency_graph() / read_dependency_graph(), validated with validate_graph_json()). Both formats are documented under ?write_split_spec and ?write_dependency_graph. Because schema_version follows a documented major-compatibility policy, a file written by an older splitGraph still loads; migrate_split_spec_json() upgrades it to the current version in place.

Letting the spec pick the adapter

The three adapters above cover different shapes of split. You do not have to choose between them by hand: split_spec carries recommended_resampling, so a single dispatcher can route each spec to the right one. This makes a pipeline that handles subject, batch, time, and composite specs uniformly.

recommend_adapter <- function(spec) {
  switch(
    spec$recommended_resampling,
    grouped_cv          = "group_vfold_cv (group = group_id)",
    blocked_cv          = "group_vfold_cv (group = group_id)",
    custom_grouped_cv   = "group_vfold_cv (group = group_id)",
    leave_one_group_out = "leave-one-group-out over group_id",
    ordered_split       = "rolling_origin (order by order_rank)",
    "group_vfold_cv (default)"
  )
}

# The subject spec recommends grouped CV; a time-mode spec recommends ordering.
recommend_adapter(spec)
#> [1] "group_vfold_cv (group = group_id)"
time_spec <- as_split_spec(derive_split_constraints(g, mode = "time"), graph = g)
recommend_adapter(time_spec)
#> [1] "rolling_origin (order by order_rank)"

recommended_resampling is only a hint — your adapter is free to override it — but it lets one entry point serve every constraint mode without inspecting the graph.

When you need a custom adapter

The only assumptions an adapter has to honor:

That is the whole interface, and it is stable: bioLeak::as_leaksplits() consumes exactly these fields, and a contract test in splitGraph pins the seam so it cannot drift. As long as those fields are honored, anything is a valid downstream consumer.