Beyond the classic subject / batch / study / time relations,
splitGraph models several further leakage axes, in two
families:
subject or
batch.This vignette builds and groups by each, and shows how the threshold drives the pairwise grouping.
graph_from_metadata() auto-detects site_id,
region_id, platform_id, and
assay_id columns and builds the corresponding typed nodes
and edges. Each then has its own constraint mode. The example below uses
site, platform, and assay; region behaves identically (a
region_id column and mode = "region") and is
omitted only to keep the output short.
meta <- data.frame(
sample_id = paste0("S", 1:6),
subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"),
site_id = c("NYC", "NYC", "BOS", "BOS", "NYC", "BOS"),
platform_id = c("illumina", "illumina", "nanopore", "nanopore", "illumina", "nanopore"),
assay_id = c("rnaseq", "rnaseq", "rnaseq", "wgs", "wgs", "wgs"),
stringsAsFactors = FALSE
)
g <- graph_from_metadata(meta, graph_name = "structure-demo")
grouping_vector(derive_split_constraints(g, mode = "site"))
#> S1 S2 S3 S4 S5 S6
#> "site:NYC" "site:NYC" "site:BOS" "site:BOS" "site:NYC" "site:BOS"
grouping_vector(derive_split_constraints(g, mode = "platform"))
#> S1 S2 S3 S4
#> "platform:illumina" "platform:illumina" "platform:nanopore" "platform:nanopore"
#> S5 S6
#> "platform:illumina" "platform:nanopore"
grouping_vector(derive_split_constraints(g, mode = "assay"))
#> S1 S2 S3 S4 S5
#> "assay:rnaseq" "assay:rnaseq" "assay:rnaseq" "assay:wgs" "assay:wgs"
#> S6
#> "assay:wgs"Whatever mode is primary, every detected cluster relation is also
carried into the split_spec as a blocking
annotation, so a downstream consumer can block on site, platform,
or assay even when the split unit is something else — here, subject:
spec <- as_split_spec(derive_split_constraints(g, mode = "subject"), graph = g)
spec$block_vars
#> [1] "site_group" "platform_group" "assay_group"
head(spec$sample_data[, c("sample_id", "group_id",
"site_group", "platform_group", "assay_group")])
#> sample_id group_id site_group platform_group assay_group
#> 1 S1 subject:P1 NYC illumina rnaseq
#> 2 S2 subject:P1 NYC illumina rnaseq
#> 3 S3 subject:P2 BOS nanopore rnaseq
#> 4 S4 subject:P2 BOS nanopore wgs
#> 5 S5 subject:P3 NYC illumina wgs
#> 6 S6 subject:P3 BOS nanopore wgsAny of these relations can also participate in a composite derivation, where several dependency sources are combined and each connected component becomes one group:
Spatial proximity works the same way over sample coordinates — for
example spot locations from spatial transcriptomics, positions on a
tissue slide, or geographic site coordinates.
spatial_edges_from_coords() connects samples within a
radius (Euclidean distance over the coordinate columns), and
mode = "spatial" groups the resulting connected
components.
# Two spatial clusters. Cluster 1 (S1-S3) is a chain: neighbouring pairs are
# within the radius, but the endpoints are not.
coords <- data.frame(
sample_id = paste0("S", 1:6),
x = c(0, 1, 2, 6.0, 6.9, 6.2),
y = c(0, 1, 0, 6.0, 6.6, 5.3),
stringsAsFactors = FALSE
)
adj_edges <- spatial_edges_from_coords(coords, radius = 1.5)
meta_s <- data.frame(
sample_id = paste0("S", 1:6),
subject_id = paste0("P", 1:6),
stringsAsFactors = FALSE
)
samples_s <- create_nodes(meta_s, "Sample", "sample_id")
subjects_s <- create_nodes(meta_s, "Subject", "subject_id")
belongs_s <- create_edges(meta_s, "sample_id", "subject_id",
"Sample", "Subject", "sample_belongs_to_subject")
g_sp <- build_dependency_graph(list(samples_s, subjects_s), list(belongs_s, adj_edges))
sp_groups <- grouping_vector(derive_split_constraints(g_sp, mode = "spatial"))
sp_groups
#> S1 S2 S3
#> "spatial:component_1" "spatial:component_1" "spatial:component_1"
#> S4 S5 S6
#> "spatial:component_2" "spatial:component_2" "spatial:component_2"Plotting the coordinates, drawing the within-radius adjacency edges
in grey, and colouring points by the derived group makes the transitive
closure concrete: S1–S2 and S2–S3 are each within the 1.5
radius, so all three share a group even though S1 and S3 are
2 units apart and were never linked directly. Every sample
in the second cluster is likewise reachable from the others, while the
two clusters are far enough apart to stay separate:
sp_grp <- factor(sp_groups[coords$sample_id])
row_of <- setNames(seq_len(nrow(coords)), coords$sample_id)
from_i <- row_of[sub("^sample:", "", adj_edges$data$from)]
to_i <- row_of[sub("^sample:", "", adj_edges$data$to)]
palette_sp <- c("#4C78A8", "#F58518")
plot(coords$x, coords$y, type = "n", asp = 1, xlab = "x", ylab = "y",
main = "Spatial groups (radius = 1.5)")
segments(coords$x[from_i], coords$y[from_i],
coords$x[to_i], coords$y[to_i], col = "grey60", lwd = 2)
points(coords$x, coords$y, pch = 19, cex = 3.5, col = palette_sp[as.integer(sp_grp)])
text(coords$x, coords$y, labels = coords$sample_id, col = "white", cex = 0.8, font = 2)
legend("topleft", legend = levels(sp_grp), pch = 19,
col = palette_sp[seq_along(levels(sp_grp))], title = "Spatial group", bty = "n")Real splits are derived on a subset of samples — the
training rows, say. For pairwise (and composite) modes this raises a
subtle question: if a sample that bridges two others is left
out of the subset, could those two still inherit a shared group from the
full graph? They do not. When you pass samples =, grouping
is recomputed within that subset, so structure that exists only through
an excluded sample never leaks across the split.
The spatial chain makes this visible. S1 and S3 shared a group only because S2 bridged them; ask for S1 and S3 alone, and they correctly fall into separate groups:
Because the threshold (kinship cutoff, spatial radius) is applied up
front in the edge-building helpers, it is a derivation input,
not a modeling choice: splitGraph forms groups over
whatever edges survive and never computes folds itself. The resulting
split_spec is handed to a downstream consumer for
execution, exactly as with every other mode — see the
adapter-cookbook and cross-language-handoff vignettes
for that step.