---
title: "Cross-language handoff: R to JSON to Python to scikit-learn"
author: "Selçuk Korkmaz"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Cross-language handoff: R to JSON to Python to scikit-learn}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)
```

`split_spec` is designed as an *interchange format*, not as internal plumbing
for any one downstream package. This vignette shows the full path: derive a
constraint in R, serialize it to JSON, read it in Python with the shipped
`splitspec` reference consumer, and hand the recovered grouping straight to a
scikit-learn resampler.

The Python chunks below are shown but not executed, so building the vignette
needs no Python. To keep the central claim honest rather than asserted, the
vignette *does* run the shipped Python reader through R when a `python3`
interpreter is available (see "Verify the round-trip"), and shows that the
grouping it recovers matches R's exactly.

# Derive and serialize in R

```{r derive}
library(splitGraph)

meta <- data.frame(
  sample_id    = c("S1", "S2", "S3", "S4", "S5"),
  subject_id   = c("P1", "P1", "P2", "P3", "P3"),
  timepoint_id = c("T0", "T1", "T0", "T2", "T0"),
  time_index   = c(0, 1, 0, 2, 0),
  stringsAsFactors = FALSE
)

g <- graph_from_metadata(meta, graph_name = "handoff-demo")

# Group so that repeated measures of the same subject never straddle a split.
constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(constraint, graph = g)

path <- tempfile(fileext = ".json")
write_split_spec(spec, path)
```

The written file carries a `$schema` reference and a `schema_version`, and can
be validated against the shipped JSON Schema before it ever leaves R:

```{r validate}
report <- validate_split_spec_json(path)
report$valid

# The R-side grouping we expect Python to reproduce:
grouping_vector(constraint)
```

# Read in Python

The reference consumer lives in the installed package under `inst/python`. On
the R side its location is:

```{r pypath}
system.file("python", package = "splitGraph")
```

Point Python at that directory (or install/copy the `splitspec` package), then:

```python
import sys
# sys.path.append(<the inst/python path printed above>)
from splitspec import load_split_spec

spec = load_split_spec("split_spec.json")

spec.schema_version        # "0.2.0"
spec.constraint_mode       # "subject"
spec.recommended_resampling  # "grouped_cv"

# Grouping keyed by sample_id — identical to R's grouping_vector():
spec.grouping()
# {'S1': 'subject:P1', 'S2': 'subject:P1', 'S3': 'subject:P2',
#  'S4': 'subject:P3', 'S5': 'subject:P3'}

df = spec.to_frame()       # pandas DataFrame of sample_data
```

# Verify the round-trip

Rather than take the comment above on faith, we can run the shipped Python
reader on the exact file we just wrote and compare what it recovers to R's
`grouping_vector()`. This is what `inst/python/conformance.py` does; the chunk
below invokes it through R and only runs when a `python3` interpreter is
present, so the vignette still builds without Python.

```{r conformance, eval = nzchar(Sys.which("python3")) && requireNamespace("jsonlite", quietly = TRUE)}
script   <- system.file("python", "conformance.py", package = "splitGraph")
out_path <- tempfile(fileext = ".json")

# Run the Python reader on our JSON file; it writes back what it recovered.
status <- system2(
  "python3", c("-B", shQuote(script), shQuote(path), shQuote(out_path)),
  stdout = FALSE, stderr = FALSE
)

if (status == 0 && file.exists(out_path)) {
  recovered <- jsonlite::fromJSON(out_path)

  # Grouping recovered by Python:
  print(unlist(recovered$grouping))

  # Identical to the grouping R produced?
  r_grouping <- grouping_vector(constraint)
  cat("Python matches R exactly:",
      identical(unlist(recovered$grouping)[names(r_grouping)],
                r_grouping[names(r_grouping)]), "\n")
}
```

The same script also checks `order_rank`, and the package's test suite runs this
comparison as an automated conformance test (skipped when Python is absent, and
never on CRAN). The point is that the partition is *decided once* in R and only
*reproduced* elsewhere — the two languages cannot disagree.

# Drive scikit-learn

The grouping vector plugs directly into `GroupKFold` (or
`StratifiedGroupKFold`), guaranteeing that all samples from a subject land in
the same fold:

```python
import numpy as np
from sklearn.model_selection import GroupKFold

groups = spec.groups()          # group_id per sample, in file order
X = np.zeros((len(groups), 1))  # placeholder design matrix

for train_idx, test_idx in GroupKFold(n_splits=3).split(X, groups=groups):
    train_groups = {groups[i] for i in train_idx}
    test_groups  = {groups[i] for i in test_idx}
    assert train_groups.isdisjoint(test_groups)  # no subject leaks across
```

For an ordered evaluation (a `mode = "time"` spec), sort by `order_rank` first
and use `TimeSeriesSplit`:

```python
from sklearn.model_selection import TimeSeriesSplit

order = spec.ordered_index()    # row indices sorted by order_rank
df_ordered = spec.to_frame().iloc[order].reset_index(drop=True)

for train_idx, test_idx in TimeSeriesSplit(n_splits=3).split(df_ordered):
    ...
```

# Why this matters

The leakage-aware partition is decided *once*, in R, from explicit and validated
dependency structure — and every other language merely *reproduces* it from the
`split_spec`. Nothing about the split logic is re-implemented in Python, so the
two sides cannot drift. `split_spec` is the contract; scikit-learn (here) and
`rsample` (on the R side) are just interchangeable consumers of it. That is what
makes it an interchange format rather than internal plumbing for any one tool.
