Cross-language handoff: R to JSON to Python to scikit-learn

Selçuk Korkmaz

2026-07-03

split_spec is designed as an interchange format, not as internal plumbing for any one downstream package. This vignette shows the full path: derive a constraint in R, serialize it to JSON, read it in Python with the shipped splitspec reference consumer, and hand the recovered grouping straight to a scikit-learn resampler.

The Python chunks below are shown but not executed, so building the vignette needs no Python. To keep the central claim honest rather than asserted, the vignette does run the shipped Python reader through R when a python3 interpreter is available (see “Verify the round-trip”), and shows that the grouping it recovers matches R’s exactly.

Derive and serialize in R

library(splitGraph)

meta <- data.frame(
  sample_id    = c("S1", "S2", "S3", "S4", "S5"),
  subject_id   = c("P1", "P1", "P2", "P3", "P3"),
  timepoint_id = c("T0", "T1", "T0", "T2", "T0"),
  time_index   = c(0, 1, 0, 2, 0),
  stringsAsFactors = FALSE
)

g <- graph_from_metadata(meta, graph_name = "handoff-demo")

# Group so that repeated measures of the same subject never straddle a split.
constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(constraint, graph = g)

path <- tempfile(fileext = ".json")
write_split_spec(spec, path)

The written file carries a $schema reference and a schema_version, and can be validated against the shipped JSON Schema before it ever leaves R:

report <- validate_split_spec_json(path)
report$valid
#> [1] TRUE

# The R-side grouping we expect Python to reproduce:
grouping_vector(constraint)
#>           S1           S2           S3           S4           S5 
#> "subject:P1" "subject:P1" "subject:P2" "subject:P3" "subject:P3"

Read in Python

The reference consumer lives in the installed package under inst/python. On the R side its location is:

system.file("python", package = "splitGraph")
#> [1] "/private/var/folders/dj/y28dp44x303ggfg6rg8n2v0h0000gn/T/Rtmp3C2BA2/Rbuildbb734cd7ed7b/splitGraph/inst/python"

Point Python at that directory (or install/copy the splitspec package), then:

import sys
# sys.path.append(<the inst/python path printed above>)
from splitspec import load_split_spec

spec = load_split_spec("split_spec.json")

spec.schema_version        # "0.2.0"
spec.constraint_mode       # "subject"
spec.recommended_resampling  # "grouped_cv"

# Grouping keyed by sample_id — identical to R's grouping_vector():
spec.grouping()
# {'S1': 'subject:P1', 'S2': 'subject:P1', 'S3': 'subject:P2',
#  'S4': 'subject:P3', 'S5': 'subject:P3'}

df = spec.to_frame()       # pandas DataFrame of sample_data

Verify the round-trip

Rather than take the comment above on faith, we can run the shipped Python reader on the exact file we just wrote and compare what it recovers to R’s grouping_vector(). This is what inst/python/conformance.py does; the chunk below invokes it through R and only runs when a python3 interpreter is present, so the vignette still builds without Python.

script   <- system.file("python", "conformance.py", package = "splitGraph")
out_path <- tempfile(fileext = ".json")

# Run the Python reader on our JSON file; it writes back what it recovered.
status <- system2(
  "python3", c("-B", shQuote(script), shQuote(path), shQuote(out_path)),
  stdout = FALSE, stderr = FALSE
)

if (status == 0 && file.exists(out_path)) {
  recovered <- jsonlite::fromJSON(out_path)

  # Grouping recovered by Python:
  print(unlist(recovered$grouping))

  # Identical to the grouping R produced?
  r_grouping <- grouping_vector(constraint)
  cat("Python matches R exactly:",
      identical(unlist(recovered$grouping)[names(r_grouping)],
                r_grouping[names(r_grouping)]), "\n")
}
#>           S1           S2           S3           S4           S5 
#> "subject:P1" "subject:P1" "subject:P2" "subject:P3" "subject:P3" 
#> Python matches R exactly: TRUE

The same script also checks order_rank, and the package’s test suite runs this comparison as an automated conformance test (skipped when Python is absent, and never on CRAN). The point is that the partition is decided once in R and only reproduced elsewhere — the two languages cannot disagree.

Drive scikit-learn

The grouping vector plugs directly into GroupKFold (or StratifiedGroupKFold), guaranteeing that all samples from a subject land in the same fold:

import numpy as np
from sklearn.model_selection import GroupKFold

groups = spec.groups()          # group_id per sample, in file order
X = np.zeros((len(groups), 1))  # placeholder design matrix

for train_idx, test_idx in GroupKFold(n_splits=3).split(X, groups=groups):
    train_groups = {groups[i] for i in train_idx}
    test_groups  = {groups[i] for i in test_idx}
    assert train_groups.isdisjoint(test_groups)  # no subject leaks across

For an ordered evaluation (a mode = "time" spec), sort by order_rank first and use TimeSeriesSplit:

from sklearn.model_selection import TimeSeriesSplit

order = spec.ordered_index()    # row indices sorted by order_rank
df_ordered = spec.to_frame().iloc[order].reset_index(drop=True)

for train_idx, test_idx in TimeSeriesSplit(n_splits=3).split(df_ordered):
    ...

Why this matters

The leakage-aware partition is decided once, in R, from explicit and validated dependency structure — and every other language merely reproduces it from the split_spec. Nothing about the split logic is re-implemented in Python, so the two sides cannot drift. split_spec is the contract; scikit-learn (here) and rsample (on the R side) are just interchangeable consumers of it. That is what makes it an interchange format rather than internal plumbing for any one tool.