--- title: "Cross-language handoff: R to JSON to Python to scikit-learn" author: "Selçuk Korkmaz" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Cross-language handoff: R to JSON to Python to scikit-learn} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` `split_spec` is designed as an *interchange format*, not as internal plumbing for any one downstream package. This vignette shows the full path: derive a constraint in R, serialize it to JSON, read it in Python with the shipped `splitspec` reference consumer, and hand the recovered grouping straight to a scikit-learn resampler. The Python chunks below are shown but not executed, so building the vignette needs no Python. To keep the central claim honest rather than asserted, the vignette *does* run the shipped Python reader through R when a `python3` interpreter is available (see "Verify the round-trip"), and shows that the grouping it recovers matches R's exactly. # Derive and serialize in R ```{r derive} library(splitGraph) meta <- data.frame( sample_id = c("S1", "S2", "S3", "S4", "S5"), subject_id = c("P1", "P1", "P2", "P3", "P3"), timepoint_id = c("T0", "T1", "T0", "T2", "T0"), time_index = c(0, 1, 0, 2, 0), stringsAsFactors = FALSE ) g <- graph_from_metadata(meta, graph_name = "handoff-demo") # Group so that repeated measures of the same subject never straddle a split. constraint <- derive_split_constraints(g, mode = "subject") spec <- as_split_spec(constraint, graph = g) path <- tempfile(fileext = ".json") write_split_spec(spec, path) ``` The written file carries a `$schema` reference and a `schema_version`, and can be validated against the shipped JSON Schema before it ever leaves R: ```{r validate} report <- validate_split_spec_json(path) report$valid # The R-side grouping we expect Python to reproduce: grouping_vector(constraint) ``` # Read in Python The reference consumer lives in the installed package under `inst/python`. On the R side its location is: ```{r pypath} system.file("python", package = "splitGraph") ``` Point Python at that directory (or install/copy the `splitspec` package), then: ```python import sys # sys.path.append() from splitspec import load_split_spec spec = load_split_spec("split_spec.json") spec.schema_version # "0.2.0" spec.constraint_mode # "subject" spec.recommended_resampling # "grouped_cv" # Grouping keyed by sample_id — identical to R's grouping_vector(): spec.grouping() # {'S1': 'subject:P1', 'S2': 'subject:P1', 'S3': 'subject:P2', # 'S4': 'subject:P3', 'S5': 'subject:P3'} df = spec.to_frame() # pandas DataFrame of sample_data ``` # Verify the round-trip Rather than take the comment above on faith, we can run the shipped Python reader on the exact file we just wrote and compare what it recovers to R's `grouping_vector()`. This is what `inst/python/conformance.py` does; the chunk below invokes it through R and only runs when a `python3` interpreter is present, so the vignette still builds without Python. ```{r conformance, eval = nzchar(Sys.which("python3")) && requireNamespace("jsonlite", quietly = TRUE)} script <- system.file("python", "conformance.py", package = "splitGraph") out_path <- tempfile(fileext = ".json") # Run the Python reader on our JSON file; it writes back what it recovered. status <- system2( "python3", c("-B", shQuote(script), shQuote(path), shQuote(out_path)), stdout = FALSE, stderr = FALSE ) if (status == 0 && file.exists(out_path)) { recovered <- jsonlite::fromJSON(out_path) # Grouping recovered by Python: print(unlist(recovered$grouping)) # Identical to the grouping R produced? r_grouping <- grouping_vector(constraint) cat("Python matches R exactly:", identical(unlist(recovered$grouping)[names(r_grouping)], r_grouping[names(r_grouping)]), "\n") } ``` The same script also checks `order_rank`, and the package's test suite runs this comparison as an automated conformance test (skipped when Python is absent, and never on CRAN). The point is that the partition is *decided once* in R and only *reproduced* elsewhere — the two languages cannot disagree. # Drive scikit-learn The grouping vector plugs directly into `GroupKFold` (or `StratifiedGroupKFold`), guaranteeing that all samples from a subject land in the same fold: ```python import numpy as np from sklearn.model_selection import GroupKFold groups = spec.groups() # group_id per sample, in file order X = np.zeros((len(groups), 1)) # placeholder design matrix for train_idx, test_idx in GroupKFold(n_splits=3).split(X, groups=groups): train_groups = {groups[i] for i in train_idx} test_groups = {groups[i] for i in test_idx} assert train_groups.isdisjoint(test_groups) # no subject leaks across ``` For an ordered evaluation (a `mode = "time"` spec), sort by `order_rank` first and use `TimeSeriesSplit`: ```python from sklearn.model_selection import TimeSeriesSplit order = spec.ordered_index() # row indices sorted by order_rank df_ordered = spec.to_frame().iloc[order].reset_index(drop=True) for train_idx, test_idx in TimeSeriesSplit(n_splits=3).split(df_ordered): ... ``` # Why this matters The leakage-aware partition is decided *once*, in R, from explicit and validated dependency structure — and every other language merely *reproduces* it from the `split_spec`. Nothing about the split logic is re-implemented in Python, so the two sides cannot drift. `split_spec` is the contract; scikit-learn (here) and `rsample` (on the R side) are just interchangeable consumers of it. That is what makes it an interchange format rather than internal plumbing for any one tool.