This vignette presents the mathematical formulation and
graph-theoretic foundations underlying corrselect.
Variable subset selection under correlation constraints is formulated as
a maximal independent set problem on threshold graphs, enabling exact
enumeration via established algorithms from computational graph theory.
The vignette defines the formal problem statement, explains the
graph-theoretic representation, details the three implemented algorithms
(Bron-Kerbosch, Eppstein-Löffler-Strash, and greedy heuristic), analyzes
their computational complexity, and provides comprehensive references to
the theoretical literature. For practical usage examples, see
vignette("quickstart") and
vignette("workflows"). For algorithmic control and
performance tuning, see vignette("advanced").
This section defines the core terms used throughout the documentation. All other vignettes refer back to these definitions.
A symmetric function
\(a: \mathcal{X} \times \mathcal{X} \to
\mathbb{R}\)
that quantifies the relationship between two variables.
Common cases:
All measures used in the package are normalized so that
\(|a_{ij}| \in [0,1]\).
A symmetric \(p \times p\) matrix
\(A\) whose entry \(a_{ij}\) is the association between
variables \(i\) and \(j\).
The diagonal satisfies \(a_{ii} =
1\).
For correlation-based analysis, \(A\)
typically comes from cor().
A user-defined cutoff in \((0,1)\).
Pairs with \(|a_{ij}| \ge \tau\) are
considered too strongly associated and cannot both appear in a valid
subset.
Common choices:
A subset \(S \subseteq
\{1,\dots,p\}\) satisfying
\(|a_{ij}| < \tau\) for all distinct
\(i, j \in S\).
All pairwise associations within \(S\)
remain below the threshold.
A valid subset that cannot be enlarged.
Formally, no variable \(v \notin S\)
satisfies \(|a_{vi}| < \tau\) for
all \(i \in S\).
(“Maximal” is not the same as “maximum”, which refers to the largest possible subset.)
An undirected graph \(G = (V, E)\) where:
Edges therefore connect compatible (low-association) variables.
A subset of vertices in which every pair is connected by an
edge.
In the threshold graph, cliques correspond to valid subsets.
A clique that cannot be extended by adding any additional
vertex.
Maximal cliques correspond exactly to maximal valid subsets.
force_in)A set \(F \subseteq V\) of variables
that must appear in all returned solutions.
Only maximal cliques containing all elements of \(F\) are considered.
A degeneracy-based algorithm for maximal clique enumeration.
Recommended when force_in is used.
Complexity: \(O(d \cdot 3^{d/3})\), where \(d\) is the graph’s degeneracy.
A classical backtracking algorithm for enumerating maximal cliques,
optionally with pivoting.
Used by default when force_in is not specified.
Worst-case complexity: \(O(3^{p/3})\).
A fast heuristic that constructs a single maximal clique via greedy
selection.
Runs in \(O(p^2)\).
Does not guarantee the largest possible subset.
Enumerates all maximal cliques using ELS or
Bron–Kerbosch.
Identifies the maximum (largest) valid subset.
Chooses the method automatically:
This balances optimality with computational cost.
Key Points — Terminology
- Association matrix: Symmetric matrix of pairwise relationships (correlations, Cramér’s V, etc.)
- Threshold (τ): Cutoff determining which pairs are “too associated” to coexist
- Valid subset: All pairwise associations below τ
- Maximal subset: Valid subset that cannot be enlarged
- Threshold graph: Graph where edges connect compatible (low-association) variable pairs
- Clique: Fully connected subgraph; maximal cliques = maximal valid subsets
Before diving into formal definitions, let’s build intuition with a simple conceptual overview.
Imagine you have a dataset with many predictors, some of which are highly correlated. For example:
When building statistical models, including highly correlated predictors creates problems:
The solution: remove redundant predictors while keeping as many variables as possible.
corrselect transforms this statistical problem into a graph problem:
Each maximal clique represents a valid subset: a group of variables where every pair has correlation below τ.
A maximal subset cannot be extended by adding more variables—it’s locally complete.
A maximum subset is the single largest possible subset—globally optimal.
corrselect finds all maximal subsets because:
Consider 4 variables with this correlation matrix:
# Construct a simple 4x4 correlation matrix
cor_4var <- matrix(c(
1.00, 0.85, 0.10, 0.15,
0.85, 1.00, 0.12, 0.18,
0.10, 0.12, 1.00, 0.75,
0.15, 0.18, 0.75, 1.00
), nrow = 4, byrow = TRUE)
colnames(cor_4var) <- rownames(cor_4var) <- paste0("V", 1:4)
# Display matrix
print(cor_4var)
#> V1 V2 V3 V4
#> V1 1.00 0.85 0.10 0.15
#> V2 0.85 1.00 0.12 0.18
#> V3 0.10 0.12 1.00 0.75
#> V4 0.15 0.18 0.75 1.00Observations:
Set threshold τ = 0.7. Which pairs violate the threshold?
Now we build the threshold graph where edges connect compatible variables (correlation < 0.7).
Text representation:
Variables: V1, V2, V3, V4
Edges (|correlation| < 0.7):
V1 —— V3 (cor = 0.10)
V1 —— V4 (cor = 0.15)
V2 —— V3 (cor = 0.12)
V2 —— V4 (cor = 0.18)
Missing edges (|correlation| ≥ 0.7):
V1 ⨯ V2 (cor = 0.85, too high)
V3 ⨯ V4 (cor = 0.75, too high)
Maximal cliques (maximal variable subsets):
{V1, V3}: Both connected, cannot add V2 or V4
{V1, V4}: Both connected, cannot add V2 or V3
{V2, V3}: Both connected, cannot add V1 or V4
{V2, V4}: Both connected, cannot add V1 or V3
Let’s verify this with code:
# Adjacency matrix for threshold graph (edges where |cor| < 0.7)
adj_matrix <- abs(cor_4var) < 0.7
diag(adj_matrix) <- FALSE # No self-loops
# Visualize as adjacency matrix
cat("Threshold graph edges (1 = edge exists):\n")
#> Threshold graph edges (1 = edge exists):
print(adj_matrix * 1)
#> V1 V2 V3 V4
#> V1 0 0 1 1
#> V2 0 0 1 1
#> V3 1 1 0 0
#> V4 1 1 0 0Interpretation: An edge exists between Vi and Vj if they can coexist in a valid subset.
Let’s visualize this threshold graph with nodes and edges:
# Node positions (arranged in a square for clarity)
node_pos <- matrix(c(
0, 1, # V1 (top-left)
2, 1, # V2 (top-right)
0, 0, # V3 (bottom-left)
2, 0 # V4 (bottom-right)
), ncol = 2, byrow = TRUE)
# Plot setup
par(mar = c(2, 2, 3, 2))
plot(node_pos, type = "n", xlim = c(-0.5, 2.5), ylim = c(-0.5, 1.5),
xlab = "", ylab = "", axes = FALSE,
main = "Threshold Graph (τ = 0.7)\nEdges connect variables with |correlation| < 0.7")
# Draw edges (where correlation < 0.7)
edge_color <- rgb(0.2, 0.5, 0.8, 0.6)
edge_lwd <- 2
for (i in 1:4) {
for (j in 1:4) {
if (i < j && adj_matrix[i, j]) {
segments(node_pos[i, 1], node_pos[i, 2],
node_pos[j, 1], node_pos[j, 2],
col = edge_color, lwd = edge_lwd)
}
}
}
# Draw nodes
node_size <- 0.15
for (i in 1:4) {
# Node circle
symbols(node_pos[i, 1], node_pos[i, 2],
circles = node_size, add = TRUE,
inches = FALSE, bg = "white", fg = "black", lwd = 2)
# Node label
text(node_pos[i, 1], node_pos[i, 2],
labels = paste0("V", i), cex = 1.2, font = 2)
}
# Add correlation annotations
text(1, 1.35, "cor = 0.85\n(too high, no edge)", cex = 0.8, col = "red")
text(1, -0.35, "cor = 0.75\n(too high, no edge)", cex = 0.8, col = "red")
# Legend showing maximal cliques
legend("right",
legend = c("Maximal cliques:", "{V1, V3}", "{V1, V4}", "{V2, V3}", "{V2, V4}"),
bty = "o", bg = "white", cex = 0.9,
pch = c(NA, 19, 19, 19, 19),
col = c(NA, "black", "black", "black", "black"))
# Add box around graph
box()Graph interpretation:
Maximal cliques (groups where everyone connects to everyone):
This visual representation makes the graph-theoretic formulation concrete: finding maximal valid variable subsets is equivalent to finding maximal cliques in the threshold graph.
For larger matrices, we can use igraph to visualize the threshold
graph structure. Let’s demonstrate with cor_example, which
has 20 variables with known block structure:
data(cor_example)
# Build threshold graph (edges where |correlation| < 0.7)
threshold <- 0.7
adj_mat <- abs(cor_example) < threshold
diag(adj_mat) <- FALSE
if (requireNamespace("igraph", quietly = TRUE)) {
library(igraph)
# Create graph from adjacency matrix
g <- graph_from_adjacency_matrix(adj_mat, mode = "undirected")
# Find maximal cliques
cliques <- max_cliques(g)
cat(sprintf("Found %d maximal cliques at threshold %.1f\n", length(cliques), threshold))
# Color nodes by which block they belong to
block_colors <- c(rep("#d73027", 5), # Block 1 (V1-V5): high correlation
rep("#fc8d59", 5), # Block 2 (V6-V10): moderate
rep("#91bfdb", 5), # Block 3 (V11-V15): low
rep("#4575b4", 5)) # Block 4 (V16-V20): minimal
# Plot network
par(mar = c(1, 1, 3, 1))
plot(g,
vertex.size = 10,
vertex.color = block_colors,
vertex.label.cex = 0.8,
vertex.label.color = "black",
edge.color = rgb(0.5, 0.5, 0.5, 0.3),
edge.width = 1,
layout = layout_with_fr(g),
main = sprintf("Threshold Graph (τ = %.1f): Variables with |cor| < %.1f are connected",
threshold, threshold))
# Add legend
legend("topleft",
legend = c("Block 1 (V1-V5): High cor",
"Block 2 (V6-V10): Moderate cor",
"Block 3 (V11-V15): Low cor",
"Block 4 (V16-V20): Minimal cor"),
fill = c("#d73027", "#fc8d59", "#91bfdb", "#4575b4"),
bty = "o", bg = "white", cex = 0.8)
} else {
cat("Install igraph for network visualization: install.packages('igraph')\n")
cat("Adjacency matrix (first 5×5 block):\n")
print(adj_mat[1:5, 1:5] * 1)
}
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
#> Found 5 maximal cliques at threshold 0.7Interpretation:
A clique is a group where every pair is connected. In our graph:
Potential cliques:
Maximal cliques: Can any 2-variable clique be extended?
{V1, V3}
Add V2? No (V1–V2 not connected).
Add V4? No (V3–V4 not connected).
Maximal ✓
{V1, V4}
Add V2? No (V1–V2 not connected).
Add V3? No (V3–V4 not connected).
Maximal ✓
{V2, V3}
Add V1? No (V1–V2 not connected).
Add V4? No (V3–V4 not connected).
Maximal ✓
{V2, V4}
Add V1? No (V1–V2 not connected).
Add V3? No (V3–V4 not connected).
Maximal ✓
So there are 4 maximal cliques of size 2, each representing a valid variable subset.
Let corrselect confirm this:
results <- MatSelect(cor_4var, threshold = 0.7, method = "bron-kerbosch")
show(results)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Threshold: 0.700
#> Subsets: 4 maximal subsets
#> Data Rows: 4 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] V1, V3 0.100 0.100 2
#> [ 2] V2, V3 0.120 0.120 2
#> [ 3] V1, V4 0.150 0.150 2
#> [ 4] V2, V4 0.180 0.180 2Result interpretation:
This toy example shows why corrselect enumerates all solutions:
Real datasets have similar structure but with more variables and more complex clustering.
Key Points — Intuitive Overview
- corrselect finds variable subsets where no pair exceeds a correlation threshold
- Multiple valid subsets typically exist; all are enumerated for user choice
- Graph representation: variables = nodes, low-correlation pairs = edges
- Maximal cliques in this graph = maximal valid subsets
- Comparing subsets reveals correlation structure (clusters of related variables)
The core problem is straightforward: given a set of \(p\) variables with known pairwise associations (correlations), identify all largest possible subsets where every pair of variables has an association below a user-defined threshold \(\tau\).
Think of this as a social network where variables are people and edges represent “get along well” (low correlation). We want to find all maximal groups of people where everyone gets along with everyone else in the group. A group is “maximal” if we cannot add any more people without introducing a conflict.
Some variables may be designated as “must include” (forced-in). In the social network analogy, these are VIPs who must be in every group, so we only search for groups containing all VIPs.
Input:
Constraints:
A subset \(S \subseteq \{1, \dots, p\}\) is valid if:
Objective:
Find all maximal valid subsets \(\mathcal{S} = \{S_1, \dots, S_m\}\) where \(S_k\) is maximal if no variable \(v \notin S_k\) satisfies \(|a_{vi}| < \tau\) for all \(i \in S_k\).
Output:
Collection \(\mathcal{S}\) containing all maximal valid subsets.
Given \(p\) variables, compute an association matrix \(A \in \mathbb{R}^{p \times p}\) where:
\[ a_{ij} = \text{association}(X_i, X_j) \]
For numeric variables, \(a_{ij}\) may be a correlation coefficient (Pearson, Spearman, Kendall, etc.). For mixed-type variables, \(a_{ij}\) is chosen based on variable types:
| Type \((X_i, X_j)\) | Measure |
|---|---|
| numeric, numeric | Pearson/Spearman/Kendall |
| numeric, factor | \(\eta^2\) |
| numeric, ordered | Spearman/Kendall |
| factor, factor | Cramér’s V |
| factor, ordered | Cramér’s V |
| ordered, ordered | Spearman/Kendall |
All measures are bounded: \(a_{ij} \in [0, 1]\) or \(a_{ij} \in [-1, 1]\).
Fix a threshold \(\tau \in (0, 1)\). A subset \(S \subseteq \{1, \dots, p\}\) is valid if:
\[ \forall i, j \in S,\ i \neq j: \quad |a_{ij}| < \tau \]
A valid subset \(S\) is maximal if no variable \(k \notin S\) satisfies:
\[ |a_{ki}| < \tau \quad \text{for all } i \in S \]
Key Points — Problem Formulation
- Input: p × p association matrix A, threshold τ ∈ (0,1)
- Valid subset: S where |a_ij| < τ for all pairs i,j ∈ S
- Maximal: cannot add any variable without violating threshold
- Goal: enumerate all maximal valid subsets
The variable selection problem has a natural graph representation that connects it to decades of research in computational graph theory. By viewing variables as nodes and “compatible pairs” (low correlation) as edges, we transform the statistical problem into a well-studied graph problem with proven algorithms.
This representation is powerful because:
The key insight: a group of mutually compatible variables (valid subset) is exactly a clique in a graph where edges represent compatibility.
Define the threshold graph \(G = (V, E)\) where:
Note the reversal: An edge \((i, j)\) means variables \(i\) and \(j\) have low correlation (can coexist). This is the complement of the typical “correlation graph” where edges represent high correlation.
A valid subset \(S\) corresponds to a clique in \(G\): all pairs in \(S\) are connected.
A maximal valid subset corresponds to a maximal clique: a clique that cannot be extended.
Finding all maximal valid subsets is equivalent to enumerating all maximal cliques in \(G\).
Consider 6 variables with correlation matrix:
# Create example correlation matrix
set.seed(123)
cor_6var <- matrix(c(
1.00, 0.85, 0.75, 0.20, 0.15, 0.10,
0.85, 1.00, 0.80, 0.25, 0.20, 0.15,
0.75, 0.80, 1.00, 0.30, 0.25, 0.20,
0.20, 0.25, 0.30, 1.00, 0.65, 0.55,
0.15, 0.20, 0.25, 0.65, 1.00, 0.60,
0.10, 0.15, 0.20, 0.55, 0.60, 1.00
), nrow = 6, byrow = TRUE)
rownames(cor_6var) <- colnames(cor_6var) <- paste0("V", 1:6)
# Display correlation matrix
print(round(cor_6var, 2))
#> V1 V2 V3 V4 V5 V6
#> V1 1.00 0.85 0.75 0.20 0.15 0.10
#> V2 0.85 1.00 0.80 0.25 0.20 0.15
#> V3 0.75 0.80 1.00 0.30 0.25 0.20
#> V4 0.20 0.25 0.30 1.00 0.65 0.55
#> V5 0.15 0.20 0.25 0.65 1.00 0.60
#> V6 0.10 0.15 0.20 0.55 0.60 1.00Threshold graph construction with \(\tau = 0.7\):
# Build adjacency matrix for threshold graph
tau <- 0.7
adj_matrix <- abs(cor_6var) < tau
diag(adj_matrix) <- FALSE
# Visualize correlation structure and threshold graph
par(mfrow = c(1, 2))
# Panel 1: Correlation heatmap
col_pal <- colorRampPalette(c("#3B4992", "white", "#EE0000"))(100)
image(1:6, 1:6, t(cor_6var[6:1, ]),
col = col_pal,
xlab = "", ylab = "",
main = "Correlation Matrix",
axes = FALSE,
zlim = c(-1, 1))
axis(1, at = 1:6, labels = colnames(cor_6var), cex.axis = 0.8)
axis(2, at = 6:1, labels = colnames(cor_6var), cex.axis = 0.8)
# Add correlation values
for (i in 1:6) {
for (j in 1:6) {
col_text <- if (abs(cor_6var[j, i]) > 0.5) "white" else "black"
text(i, 7 - j, sprintf("%.2f", cor_6var[j, i]),
cex = 0.7, col = col_text)
}
}
abline(h = 3.5, lwd = 2, lty = 2, col = "black")
abline(v = 3.5, lwd = 2, lty = 2, col = "black")
# Panel 2: Threshold graph (edges where |cor| < tau)
plot.new()
plot.window(xlim = c(0, 1), ylim = c(0, 1))
title(main = sprintf("Threshold Graph (τ = %.1f)", tau))
# Node positions (2 groups based on correlation structure)
pos <- matrix(c(
0.2, 0.8, # V1
0.3, 0.6, # V2
0.1, 0.4, # V3
0.7, 0.8, # V4
0.8, 0.5, # V5
0.9, 0.3 # V6
), ncol = 2, byrow = TRUE)
# Draw edges (compatible variable pairs)
for (i in 1:5) {
for (j in (i + 1):6) {
if (adj_matrix[i, j]) {
lines(c(pos[i, 1], pos[j, 1]), c(pos[i, 2], pos[j, 2]),
col = "gray70", lwd = 1.5)
}
}
}
# Draw nodes
points(pos[, 1], pos[, 2], pch = 21, cex = 4,
bg = c(rep(rgb(0.8, 0.2, 0.2, 0.7), 3),
rep(rgb(0.2, 0.5, 0.8, 0.7), 3)),
col = "black", lwd = 2)
# Add labels
text(pos[, 1], pos[, 2], labels = colnames(cor_6var),
col = "white", font = 2, cex = 0.8)
# Add legend
legend("bottomleft",
legend = c("Clique 1 (V1-V3)", "Clique 2 (V4-V6)", "Edge: |cor| < 0.7"),
pch = c(21, 21, NA),
pt.bg = c(rgb(0.8, 0.2, 0.2, 0.7), rgb(0.2, 0.5, 0.8, 0.7), NA),
pt.cex = 2,
lty = c(NA, NA, 1),
col = c("black", "black", "gray70"),
lwd = c(2, 2, 1.5),
bty = "o",
bg = "white",
cex = 0.7)Interpreting the visualization:
The left panel shows the correlation matrix with clear block structure. Variables V1-V3 are highly correlated with each other (correlations 0.75-0.85, shown in red), as are V4-V6 (correlations 0.55-0.65). Between-block correlations are low (0.10-0.30, shown in blue/white).
The right panel shows the corresponding threshold graph with \(\tau = 0.7\). An edge connects two variables if their absolute correlation is below 0.7 (compatible variables). Note that:
This graph structure immediately reveals the two maximal cliques: \(\{V1, V2, V3\}\) and \(\{V4, V5, V6\}\). No variable can be added to either clique without violating the threshold constraint.
Identify maximal cliques:
# Run MatSelect to find all maximal subsets
results <- MatSelect(cor_6var, threshold = 0.7, method = "els")
show(results)
#> CorrCombo object
#> -----------------
#> Method: els
#> Threshold: 0.700
#> Subsets: 3 maximal subsets
#> Data Rows: 6 used in correlation
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] V1, V4, V5, V6 0.375 0.650 4
#> [ 2] V2, V4, V5, V6 0.400 0.650 4
#> [ 3] V3, V4, V5, V6 0.425 0.650 4Interpreting the results:
MatSelect identified two maximal subsets of size 3, exactly matching the visual graph analysis. Both subsets satisfy \(|a_{ij}| < 0.7\) for all pairs within the subset.
Neither subset can be extended: adding any variable from the other group would introduce a pair with correlation above 0.7. For this symmetric example, both subsets have equal size and are equally valid solutions. In practice, you might choose based on domain knowledge (prefer variables with established theory) or downstream model performance.
Key Points — Graph-Theoretic Interpretation
- Build threshold graph: nodes = variables, edges connect pairs with |a_ij| < τ
- Maximal valid subsets ↔︎ maximal cliques in threshold graph
- Proven algorithms (ELS, Bron-Kerbosch) enumerate all maximal cliques efficiently
- Graph structure reveals variable clustering (densely connected = similar variables)
The mathematical concepts defined earlier map directly onto the function arguments and behavior of the package. The correspondence is outlined below.
threshold argumentControls which edges appear in the threshold graph.
corrPrune(data, threshold = 0.7) keeps an edge only if
\(|a_{ij}| < 0.7\)corrSelect() returns all maximal
cliques (full exact enumeration)corrPrune() returns one maximal clique
(greedy or exact, depending on mode)force_in argumentEnsures that certain variables appear in every returned subset.
corrPrune(data, threshold = 0.7, force_in = c("age", "gender"))mode and method
argumentsmode = "exact"mode = "greedy"mode = "auto"Choice of enumeration algorithm:
method = "els"force_in is
specifiedmethod = "bron-kerbosch"How the association structure enters the algorithm:
corrPrune(data)MatSelect(cor_matrix)assocSelect(data)Depends directly on the threshold:
These properties motivate the auto mode: exact for small
\(p\), greedy for larger \(p\).
Mathematical formulation:
\[ \text{Find all maximal } S \subseteq \{1,\dots,p\} \text{ such that } |a_{ij}| < 0.7\ \forall i, j \in S,\ \text{with } \{ \text{age} \} \subseteq S. \]
Implementation:
results <- corrSelect(
data = mydata,
threshold = 0.7, # τ = 0.7
force_in = "age", # F = {age}
mode = "exact", # enumerate all maximal cliques
method = "els" # use ELS algorithm (recommended with force_in)
)The returned object contains all maximal cliques, each representing a valid variable subset that satisfies the threshold constraint and includes the required variables.
Key Points — From Theory to Implementation
threshold= τ (correlation cutoff)force_in= F (variables required in all subsets)mode = "exact"→ enumerate all maximal cliquesmode = "greedy"→ fast heuristic, single subsetmethod = "els"recommended when usingforce_in
Two algorithms enumerate all maximal cliques exactly:
Uses degeneracy ordering to structure the search:
Formally, define:
\[ \text{extend}(R, P) = \begin{cases} \{R\}, & P = \emptyset \\ \bigcup_{v \in P} \text{extend}(R \cup \{v\}, P \cap N(v)), & \text{otherwise} \end{cases} \]
where \(N(v)\) denotes neighbors of \(v\) in \(G\).
Classical recursive backtracking with optional pivoting.
Let \(R\) = current clique, \(P\) = candidates, \(X\) = excluded nodes:
\[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset \\ \text{for each } v \in P \setminus N(u): \\ \quad \text{BK}(R \cup \{v\}, P \cap N(v), X \cap N(v)) \\ \quad P \leftarrow P \setminus \{v\}, \quad X \leftarrow X \cup \{v\} \end{cases} \]
Pivot \(u \in P \cup X\) reduces recursive calls.
Eppstein-Löffler-Strash (ELS) Algorithm:
Algorithm: ELS_MaxCliques(Graph G, ForceIn F)
Input: Threshold graph G = (V, E), forced variables F ⊆ V
Output: All maximal cliques containing F
1. Validate F forms a clique in G
2. Compute degeneracy ordering: v₁, v₂, ..., vₚ
3. Initialize results = []
4. For each vertex vᵢ in ordering:
a. If vᵢ ∉ F and vᵢ not adjacent to all in F:
Skip vᵢ (cannot extend F)
b. candidates = {vⱼ : j > i and vⱼ adjacent to vᵢ} ∩ N(F)
c. Extend(clique = {vᵢ} ∪ F, candidates, results)
5. Return results
Subroutine: Extend(R, P, results)
If P is empty:
Add R to results if maximal
Return
For each v in P:
neighbors = P ∩ N(v)
Extend(R ∪ {v}, neighbors, results)
Bron-Kerbosch Algorithm:
Algorithm: BronKerbosch(Graph G, use_pivot = TRUE)
Input: Threshold graph G = (V, E), pivot flag
Output: All maximal cliques
1. Initialize: R = ∅ (current clique)
P = V (candidates)
X = ∅ (excluded)
2. Call BK(R, P, X, results)
3. Return results
Subroutine: BK(R, P, X, results)
If P = ∅ and X = ∅:
Add R to results (R is maximal)
Return
If use_pivot:
Choose pivot u from P ∪ X with max |P ∩ N(u)|
iterate = P \ N(u) # Skip neighbors of pivot
Else:
iterate = P
For each vertex v in iterate:
BK(R ∪ {v},
P ∩ N(v), # Only candidates adjacent to v
X ∩ N(v), # Only excluded adjacent to v
results)
P = P \ {v} # Remove v from candidates
X = X ∪ {v} # Add v to excluded
Complexity:
Iteratively removes variables with highest average absolute association until all pairs satisfy \(|a_{ij}| < \tau\).
Returns a single valid subset (not necessarily maximal or optimal).
Complexity: \(O(p^2)\) vs \(O(2^p)\) for exact enumeration.
Key Points — Search Algorithms
- Exact mode: ELS or Bron-Kerbosch enumerate all maximal cliques
- ELS: O(d · 3^{d/3}), faster on sparse graphs, better with
force_in- Bron-Kerbosch: O(3^{p/3}), pivoting improves performance
- Greedy: O(p²), fast but returns single (possibly non-optimal) subset
- Rule of thumb: exact for p ≤ 20, greedy for p > 20
Input: Graph \(G = (V, E)\), forced-in set \(F\)
Output: All maximal cliques containing \(F\)
Algorithm ELS(G, F):
# Step 1: Compute degeneracy ordering
deg_order ← ComputeDegeneracyOrdering(G)
# Step 2: Initialize with forced-in variables
R ← F (current clique)
P ← V \ F (candidate vertices)
X ← ∅ (excluded vertices)
# Step 3: Validate forced-in set
for each i, j ∈ F:
if (i, j) ∉ E:
return ∅ (infeasible)
# Step 4: Recursively enumerate maximal cliques
return EnumerateCliques(R, P, X, deg_order)
Subroutine EnumerateCliques(R, P, X, ordering):
# Base case: no candidates, no exclusions → maximal clique
if P = ∅ and X = ∅:
report R as maximal clique
return
# Recursive case: extend with each candidate
for each v ∈ P (in degeneracy order):
# Extend clique
R' ← R ∪ {v}
# Update candidates: keep only neighbors of v
P' ← P ∩ N(v)
# Update exclusions: keep only neighbors of v
X' ← X ∩ N(v)
# Recurse
EnumerateCliques(R', P', X', ordering)
# Move v from candidates to exclusions
P ← P \ {v}
X ← X ∪ {v}
Subroutine ComputeDegeneracyOrdering(G):
# Degeneracy ordering: v_1, ..., v_p where each v_i has
# minimum degree in G[{v_i, ..., v_p}]
ordering ← [ ]
remaining ← V
while remaining ≠ ∅:
# Find vertex with minimum degree in induced subgraph
v ← argmin_{u ∈ remaining} |N(u) ∩ remaining|
# Add to ordering (reverse order)
ordering.prepend(v)
remaining ← remaining \ {v}
return ordering
Complexity: \(O(d \cdot 3^{d/3})\) where \(d\) is the degeneracy of \(G\).
Properties:
Input: Graph \(G = (V, E)\), forced-in set \(F\)
Output: All maximal cliques containing \(F\)
Algorithm BronKerbosch(G, F):
# Step 1: Initialize
R ← F (current clique)
P ← V \ F (candidate vertices)
X ← ∅ (excluded vertices)
# Step 2: Validate forced-in set
for each i, j ∈ F:
if (i, j) ∉ E:
return ∅ (infeasible)
# Step 3: Restrict candidates to neighbors of forced-in set
if F ≠ ∅:
P ← P ∩ (⋂_{v ∈ F} N(v))
# Step 4: Enumerate
return BK(R, P, X)
Subroutine BK(R, P, X):
# Base case: no candidates, no exclusions → maximal clique
if P = ∅ and X = ∅:
report R as maximal clique
return
# Choose pivot: vertex with most neighbors in P
u ← argmax_{v ∈ P ∪ X} |N(v) ∩ P|
# Iterate over non-neighbors of pivot (reduces recursion)
for each v ∈ P \ N(u):
# Extend clique
R' ← R ∪ {v}
# Update candidates: keep only neighbors of v
P' ← P ∩ N(v)
# Update exclusions: keep only neighbors of v
X' ← X ∩ N(v)
# Recurse
BK(R', P', X')
# Move v from candidates to exclusions
P ← P \ {v}
X ← X ∪ {v}
Complexity: \(O(3^{p/3})\) maximal cliques (worst-case).
Properties:
Pivot selection:
Input: Association matrix \(A\), threshold \(\tau\), forced-in set \(F\)
Output: Single valid subset (not necessarily maximal)
Algorithm GreedyPrune(A, τ, F):
# Step 1: Initialize with all variables
S ← {1, ..., p}
# Step 2: Validate forced-in set
if F ≠ ∅:
for each i, j ∈ F:
if |a_ij| ≥ τ:
return ∅ (infeasible)
# Step 3: Iteratively remove highest-correlated variables
while ∃ i, j ∈ S: |a_ij| ≥ τ:
# Compute average absolute correlation for each variable
for each v ∈ S \ F:
avg[v] ← (1 / |S| - 1) × Σ_{u ∈ S, u ≠ v} |a_vu|
# Remove variable with highest average correlation
v_max ← argmax_{v ∈ S \ F} avg[v]
S ← S \ {v_max}
# Step 4: Return pruned subset
return S
Complexity:
\(O(p^2 k)\), where \(k\) is the number of variables removed.
Properties:
Tie-breaking: When several variables share the same average absolute correlation:
Constraint: variables \(F \subseteq \{1, \dots, p\}\) must appear in all returned subsets.
Modify the search:
Formally, find maximal cliques in \(G\) containing \(F\).
Correlation: \(a_{ij} \in [-1, 1]\), use \(|a_{ij}|\) in threshold constraint
Association: \(a_{ij} \in [0, 1]\), use \(a_{ij}\) directly
Mixed-type data uses association matrix with measures chosen per variable-pair type.
Worst-case: \(O(3^{p/3})\) maximal cliques possible
Performance depends on graph density:
Time: \(O(p^2 k)\) where \(k\) = iterations
Space: \(O(p^2)\) for storing associations
Deterministic: same input produces same output
All selection functions return a CorrCombo S4 object
containing:
subset_list: list of character vectors (variable names
per subset)avg_corr: numeric vector (mean \(|a_{ij}|\) within each subset)min_corr: numeric vector (min \(|a_{ij}|\) within each subset)max_corr: numeric vector (max \(|a_{ij}|\) within each subset)threshold: value of \(\tau\)forced_in: forced variable namescor_method: correlation/association measure usedn_rows_used: sample size after removing missing
valuesResults are sorted by:
This section explains key design decisions underlying corrselect, addressing common questions about why certain choices were made.
Maximal: Cannot be extended further (locally optimal) Maximum: Largest possible size (globally optimal)
corrselect enumerates all maximal subsets rather than just finding the single maximum subset. Why?
Multiple equally good solutions: Real datasets often have many maximal subsets of equal or similar size. Returning only one discards valuable information about alternative variable combinations.
Domain knowledge integration: Users may prefer a slightly smaller subset containing specific variables over the globally largest subset. Having all options enables informed choice.
Sensitivity analysis: Comparing multiple maximal subsets reveals structural properties of the correlation matrix (e.g., tight vs loosely connected clusters).
Computational feasibility: Finding the maximum clique is NP-complete and often harder than enumerating all maximal cliques in practice. Modern maximal clique algorithms (ELS, Bron-Kerbosch) are highly efficient for typical correlation structures.
corrselect enforces a hard threshold: \(|a_{ij}| < \tau\) for all pairs. Alternative approaches use soft constraints (penalty functions, regularization). Why hard thresholds?
Interpretability: “No pair exceeds \(\tau\)” is a clear, verifiable guarantee. Soft constraints produce solutions where some pairs may exceed \(\tau\) with unknown magnitude.
Reproducibility: Hard thresholds produce deterministic results. Soft constraints often require tuning parameters (penalty weights, convergence criteria) that affect reproducibility.
Domain-specific requirements: Fields like ecological modeling have established thresholds (e.g., \(\tau = 0.7\) for WorldClim variables) based on empirical evidence. Hard thresholds directly implement these guidelines.
Exact enumeration: Hard constraints enable graph-theoretic formulation with exact algorithms. Soft constraints typically require heuristic optimization.
corrselect uses specialized graph algorithms (ELS, Bron-Kerbosch) rather than general optimization frameworks (integer programming, metaheuristics). Why?
Asymptotic efficiency: Graph algorithms exploit structural properties (sparsity, degeneracy) unavailable to generic solvers. For sparse graphs (low \(\tau\)), this yields orders-of-magnitude speedups.
Exact enumeration: Graph algorithms guarantee finding all maximal cliques. Optimization approaches typically find one solution.
Forced variables: Graph algorithms naturally handle forced-in constraints via initialization. Optimization approaches require additional constraints that may degrade performance.
Established theory: Maximal clique enumeration has 50+ years of algorithmic development with proven complexity bounds and implementation strategies.
corrselect considers only pairwise associations, not higher-order interactions (partial correlations, conditional independence). Why?
Computational tractability: Higher-order interactions require exponentially more computation. For \(p\) variables, pairwise methods scale as \(O(p^2)\), while \(k\)-way interactions scale as \(O(p^k)\).
Clear interpretation: “Variables \(i\) and \(j\) correlate at \(r = 0.85\)” is directly interpretable. Partial correlations require careful conditioning set selection and can be counterintuitive.
Robustness: Pairwise correlations are stable with moderate sample sizes. Partial correlations require \(n \gg p\) and are sensitive to model misspecification.
Method generality: Pairwise associations work for any variable type (numeric, categorical, mixed). Higher-order methods often require strong distributional assumptions.
When higher-order methods are appropriate: If your goal is causal discovery (identifying conditional independence structure), use methods like PC algorithm or constraint-based causal inference. corrselect focuses on reducing redundancy for predictive modeling and descriptive analysis.
Many correlation-pruning tools (e.g., caret::findCorrelation) return a single subset. corrselect can return all maximal subsets (exact mode) or one (greedy mode). Why offer exhaustive enumeration?
Alternative solutions: Multiple maximal subsets represent genuinely different variable combinations that all satisfy the threshold constraint. Arbitrarily choosing one discards information.
Downstream analysis: Different subsets may be preferred for different models or research questions. Enumerating all options enables post-hoc selection criteria.
Documentation: For reproducible research (especially JOSS/CRAN submissions), documenting all valid solutions provides transparency about algorithmic choices.
Computational cost: For typical use cases (\(p \leq 30\), \(\tau \geq 0.5\)), exhaustive enumeration completes in milliseconds. When infeasible, greedy mode provides a fast approximation.
When to use greedy mode: High-dimensional data (\(p > 50\)), dense correlation structure (high \(\tau\)), or when a single solution suffices (e.g., automated pipelines).
Key Points — Design Philosophy
- Maximal not maximum: All locally optimal subsets enumerated, not just the globally largest
- Hard threshold: Simple, interpretable guarantee vs. soft penalty optimization
- Graph algorithms: Leverage decades of research; proven correctness and complexity bounds
- Pairwise only: Computational tractability, clear interpretation, robustness
- Enumerate all: Preserves information for downstream choice; greedy available when speed critical
Maximal clique enumeration
force_in is specifiedGraph degeneracy:
Independent sets and vertex cover:
Variance inflation factor (VIF):
Correlation-based variable selection:
Numeric associations:
Categorical associations:
Mixed-type associations:
Bioclimatic modeling:
Genomics:
vignette("quickstart") - Interface overview and
examplesvignette("workflows") - Real-world workflow
examplesvignette("advanced") - Algorithmic control and custom
enginesvignette("comparison") - Comparison with
alternativessessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] igraph_2.1.4 car_3.1-3 carData_3.0-5
#> [4] microbenchmark_1.5.0 corrselect_3.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 timeDate_4051.111 dplyr_1.1.4
#> [4] farver_2.1.2 S7_0.2.0 fastmap_1.2.0
#> [7] pROC_1.19.0.1 caret_7.0-1 digest_0.6.37
#> [10] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
#> [13] survival_3.8-3 magrittr_2.0.4 compiler_4.5.1
#> [16] rlang_1.1.6 sass_0.4.10 tools_4.5.1
#> [19] yaml_2.3.10 Boruta_9.0.0 data.table_1.17.8
#> [22] knitr_1.50 plyr_1.8.9 RColorBrewer_1.1-3
#> [25] abind_1.4-8 withr_3.0.2 purrr_1.2.0
#> [28] nnet_7.3-20 grid_4.5.1 stats4_4.5.1
#> [31] future_1.67.0 ggplot2_4.0.0 globals_0.18.0
#> [34] scales_1.4.0 iterators_1.0.14 MASS_7.3-65
#> [37] cli_3.6.5 rmarkdown_2.30 generics_0.1.4
#> [40] rstudioapi_0.17.1 future.apply_1.20.0 reshape2_1.4.4
#> [43] cachem_1.1.0 stringr_1.5.2 splines_4.5.1
#> [46] parallel_4.5.1 vctrs_0.6.5 hardhat_1.4.2
#> [49] glmnet_4.1-10 Matrix_1.7-4 jsonlite_2.0.0
#> [52] Formula_1.2-5 listenv_0.9.1 systemfonts_1.3.1
#> [55] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
#> [58] recipes_1.3.1 glue_1.8.0 parallelly_1.45.1
#> [61] codetools_0.2-20 lubridate_1.9.4 stringi_1.8.7
#> [64] gtable_0.3.6 shape_1.4.6.1 tibble_3.3.0
#> [67] pillar_1.11.1 htmltools_0.5.8.1 ipred_0.9-15
#> [70] lava_1.8.2 R6_2.6.1 textshaping_1.0.3
#> [73] evaluate_1.0.5 lattice_0.22-7 bslib_0.9.0
#> [76] class_7.3-23 Rcpp_1.1.0 svglite_2.2.2
#> [79] nlme_3.1-168 prodlim_2025.04.28 ranger_0.17.0
#> [82] xfun_0.53 ModelMetrics_1.2.2.2 pkgconfig_2.0.3