split_spec is designed as an interchange
format, not as internal plumbing for any one downstream package.
This vignette shows the full path: derive a constraint in R, serialize
it to JSON, read it in Python with the shipped splitspec
reference consumer, and hand the recovered grouping straight to a
scikit-learn resampler.
The Python chunks below are shown but not executed, so building the
vignette needs no Python. To keep the central claim honest rather than
asserted, the vignette does run the shipped Python reader
through R when a python3 interpreter is available (see
“Verify the round-trip”), and shows that the grouping it recovers
matches R’s exactly.
library(splitGraph)
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4", "S5"),
subject_id = c("P1", "P1", "P2", "P3", "P3"),
timepoint_id = c("T0", "T1", "T0", "T2", "T0"),
time_index = c(0, 1, 0, 2, 0),
stringsAsFactors = FALSE
)
g <- graph_from_metadata(meta, graph_name = "handoff-demo")
# Group so that repeated measures of the same subject never straddle a split.
constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(constraint, graph = g)
path <- tempfile(fileext = ".json")
write_split_spec(spec, path)The written file carries a $schema reference and a
schema_version, and can be validated against the shipped
JSON Schema before it ever leaves R:
The reference consumer lives in the installed package under
inst/python. On the R side its location is:
system.file("python", package = "splitGraph")
#> [1] "/private/var/folders/dj/y28dp44x303ggfg6rg8n2v0h0000gn/T/Rtmp3C2BA2/Rbuildbb734cd7ed7b/splitGraph/inst/python"Point Python at that directory (or install/copy the
splitspec package), then:
import sys
# sys.path.append(<the inst/python path printed above>)
from splitspec import load_split_spec
spec = load_split_spec("split_spec.json")
spec.schema_version # "0.2.0"
spec.constraint_mode # "subject"
spec.recommended_resampling # "grouped_cv"
# Grouping keyed by sample_id — identical to R's grouping_vector():
spec.grouping()
# {'S1': 'subject:P1', 'S2': 'subject:P1', 'S3': 'subject:P2',
# 'S4': 'subject:P3', 'S5': 'subject:P3'}
df = spec.to_frame() # pandas DataFrame of sample_dataRather than take the comment above on faith, we can run the shipped
Python reader on the exact file we just wrote and compare what it
recovers to R’s grouping_vector(). This is what
inst/python/conformance.py does; the chunk below invokes it
through R and only runs when a python3 interpreter is
present, so the vignette still builds without Python.
script <- system.file("python", "conformance.py", package = "splitGraph")
out_path <- tempfile(fileext = ".json")
# Run the Python reader on our JSON file; it writes back what it recovered.
status <- system2(
"python3", c("-B", shQuote(script), shQuote(path), shQuote(out_path)),
stdout = FALSE, stderr = FALSE
)
if (status == 0 && file.exists(out_path)) {
recovered <- jsonlite::fromJSON(out_path)
# Grouping recovered by Python:
print(unlist(recovered$grouping))
# Identical to the grouping R produced?
r_grouping <- grouping_vector(constraint)
cat("Python matches R exactly:",
identical(unlist(recovered$grouping)[names(r_grouping)],
r_grouping[names(r_grouping)]), "\n")
}
#> S1 S2 S3 S4 S5
#> "subject:P1" "subject:P1" "subject:P2" "subject:P3" "subject:P3"
#> Python matches R exactly: TRUEThe same script also checks order_rank, and the
package’s test suite runs this comparison as an automated conformance
test (skipped when Python is absent, and never on CRAN). The point is
that the partition is decided once in R and only
reproduced elsewhere — the two languages cannot disagree.
The grouping vector plugs directly into GroupKFold (or
StratifiedGroupKFold), guaranteeing that all samples from a
subject land in the same fold:
import numpy as np
from sklearn.model_selection import GroupKFold
groups = spec.groups() # group_id per sample, in file order
X = np.zeros((len(groups), 1)) # placeholder design matrix
for train_idx, test_idx in GroupKFold(n_splits=3).split(X, groups=groups):
train_groups = {groups[i] for i in train_idx}
test_groups = {groups[i] for i in test_idx}
assert train_groups.isdisjoint(test_groups) # no subject leaks acrossFor an ordered evaluation (a mode = "time" spec), sort
by order_rank first and use
TimeSeriesSplit:
The leakage-aware partition is decided once, in R, from
explicit and validated dependency structure — and every other language
merely reproduces it from the split_spec. Nothing
about the split logic is re-implemented in Python, so the two sides
cannot drift. split_spec is the contract; scikit-learn
(here) and rsample (on the R side) are just interchangeable
consumers of it. That is what makes it an interchange format rather than
internal plumbing for any one tool.