Theory and Formulation

Gilles Colling

2025-11-28

Overview

This vignette presents the mathematical formulation and graph-theoretic foundations underlying corrselect. Variable subset selection under correlation constraints is formulated as a maximal independent set problem on threshold graphs, enabling exact enumeration via established algorithms from computational graph theory. The vignette defines the formal problem statement, explains the graph-theoretic representation, details the three implemented algorithms (Bron-Kerbosch, Eppstein-Löffler-Strash, and greedy heuristic), analyzes their computational complexity, and provides comprehensive references to the theoretical literature. For practical usage examples, see vignette("quickstart") and vignette("workflows"). For algorithmic control and performance tuning, see vignette("advanced").

Contents

  1. Terminology: Core definitions (association, threshold, valid subset, clique)
  2. Intuitive Overview: Conceptual introduction with toy examples
  3. Problem Formulation: Formal mathematical statement
  4. Graph-Theoretic Interpretation: Threshold graphs and maximal cliques
  5. From Theory to Implementation: How concepts map to function arguments
  6. Search Algorithms: Exact enumeration vs greedy heuristic
  7. Algorithm Pseudocode: ELS, Bron-Kerbosch, Greedy
  8. Technical Details: Forced variables, complexity, output structure
  9. Design Philosophy: Why maximal? Why hard threshold? Why graphs?
  10. References: Academic literature and further reading

Terminology

This section defines the core terms used throughout the documentation. All other vignettes refer back to these definitions.

Association measure

A symmetric function
\(a: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\)
that quantifies the relationship between two variables.

Common cases:

All measures used in the package are normalized so that
\(|a_{ij}| \in [0,1]\).


Association matrix

A symmetric \(p \times p\) matrix \(A\) whose entry \(a_{ij}\) is the association between variables \(i\) and \(j\).
The diagonal satisfies \(a_{ii} = 1\).
For correlation-based analysis, \(A\) typically comes from cor().


Threshold (\(\tau\))

A user-defined cutoff in \((0,1)\).
Pairs with \(|a_{ij}| \ge \tau\) are considered too strongly associated and cannot both appear in a valid subset.

Common choices:


Valid subset

A subset \(S \subseteq \{1,\dots,p\}\) satisfying
\(|a_{ij}| < \tau\) for all distinct \(i, j \in S\).
All pairwise associations within \(S\) remain below the threshold.


Maximal valid subset

A valid subset that cannot be enlarged.
Formally, no variable \(v \notin S\) satisfies \(|a_{vi}| < \tau\) for all \(i \in S\).

(“Maximal” is not the same as “maximum”, which refers to the largest possible subset.)


Threshold graph

An undirected graph \(G = (V, E)\) where:

Edges therefore connect compatible (low-association) variables.


Clique

A subset of vertices in which every pair is connected by an edge.
In the threshold graph, cliques correspond to valid subsets.


Maximal clique

A clique that cannot be extended by adding any additional vertex.
Maximal cliques correspond exactly to maximal valid subsets.


Forced-in variables (force_in)

A set \(F \subseteq V\) of variables that must appear in all returned solutions.
Only maximal cliques containing all elements of \(F\) are considered.


ELS (Eppstein–Löffler–Strash)

A degeneracy-based algorithm for maximal clique enumeration.
Recommended when force_in is used.

Complexity: \(O(d \cdot 3^{d/3})\), where \(d\) is the graph’s degeneracy.


Bron–Kerbosch

A classical backtracking algorithm for enumerating maximal cliques, optionally with pivoting.
Used by default when force_in is not specified.

Worst-case complexity: \(O(3^{p/3})\).


Greedy mode

A fast heuristic that constructs a single maximal clique via greedy selection.
Runs in \(O(p^2)\).
Does not guarantee the largest possible subset.


Exact mode

Enumerates all maximal cliques using ELS or Bron–Kerbosch.
Identifies the maximum (largest) valid subset.


Auto mode

Chooses the method automatically:

This balances optimality with computational cost.

Key Points — Terminology

  • Association matrix: Symmetric matrix of pairwise relationships (correlations, Cramér’s V, etc.)
  • Threshold (τ): Cutoff determining which pairs are “too associated” to coexist
  • Valid subset: All pairwise associations below τ
  • Maximal subset: Valid subset that cannot be enlarged
  • Threshold graph: Graph where edges connect compatible (low-association) variable pairs
  • Clique: Fully connected subgraph; maximal cliques = maximal valid subsets

Intuitive Overview

Before diving into formal definitions, let’s build intuition with a simple conceptual overview.

The Core Idea

Imagine you have a dataset with many predictors, some of which are highly correlated. For example:

When building statistical models, including highly correlated predictors creates problems:

  1. Coefficient instability: Small data changes cause large coefficient swings
  2. Inflated variance: Standard errors become unreliable
  3. Interpretability issues: Hard to isolate individual predictor effects

The solution: remove redundant predictors while keeping as many variables as possible.

How corrselect Works

corrselect transforms this statistical problem into a graph problem:

  1. Represent variables as nodes in a graph
  2. Draw edges between compatible variables (correlation below threshold τ)
  3. Find maximal groups where all nodes are connected (maximal cliques)

Each maximal clique represents a valid subset: a group of variables where every pair has correlation below τ.

Why “Maximal” Not “Maximum”?

A maximal subset cannot be extended by adding more variables—it’s locally complete.

A maximum subset is the single largest possible subset—globally optimal.

corrselect finds all maximal subsets because:

Toy Example (4 Variables)

Consider 4 variables with this correlation matrix:

# Construct a simple 4x4 correlation matrix
cor_4var <- matrix(c(
  1.00, 0.85, 0.10, 0.15,
  0.85, 1.00, 0.12, 0.18,
  0.10, 0.12, 1.00, 0.75,
  0.15, 0.18, 0.75, 1.00
), nrow = 4, byrow = TRUE)

colnames(cor_4var) <- rownames(cor_4var) <- paste0("V", 1:4)

# Display matrix
print(cor_4var)
#>      V1   V2   V3   V4
#> V1 1.00 0.85 0.10 0.15
#> V2 0.85 1.00 0.12 0.18
#> V3 0.10 0.12 1.00 0.75
#> V4 0.15 0.18 0.75 1.00

Observations:

Set threshold τ = 0.7. Which pairs violate the threshold?

Graph Representation

Now we build the threshold graph where edges connect compatible variables (correlation < 0.7).

Text representation:

Variables: V1, V2, V3, V4

Edges (|correlation| < 0.7):
  V1 —— V3  (cor = 0.10)
  V1 —— V4  (cor = 0.15)
  V2 —— V3  (cor = 0.12)
  V2 —— V4  (cor = 0.18)

Missing edges (|correlation| ≥ 0.7):
  V1 ⨯ V2  (cor = 0.85, too high)
  V3 ⨯ V4  (cor = 0.75, too high)

Maximal cliques (maximal variable subsets):
  {V1, V3}: Both connected, cannot add V2 or V4
  {V1, V4}: Both connected, cannot add V2 or V3
  {V2, V3}: Both connected, cannot add V1 or V4
  {V2, V4}: Both connected, cannot add V1 or V3

Let’s verify this with code:

# Adjacency matrix for threshold graph (edges where |cor| < 0.7)
adj_matrix <- abs(cor_4var) < 0.7
diag(adj_matrix) <- FALSE  # No self-loops

# Visualize as adjacency matrix
cat("Threshold graph edges (1 = edge exists):\n")
#> Threshold graph edges (1 = edge exists):
print(adj_matrix * 1)
#>    V1 V2 V3 V4
#> V1  0  0  1  1
#> V2  0  0  1  1
#> V3  1  1  0  0
#> V4  1  1  0  0

Interpretation: An edge exists between Vi and Vj if they can coexist in a valid subset.

Visual Graph Representation

Let’s visualize this threshold graph with nodes and edges:

# Node positions (arranged in a square for clarity)
node_pos <- matrix(c(
  0, 1,    # V1 (top-left)
  2, 1,    # V2 (top-right)
  0, 0,    # V3 (bottom-left)
  2, 0     # V4 (bottom-right)
), ncol = 2, byrow = TRUE)

# Plot setup
par(mar = c(2, 2, 3, 2))
plot(node_pos, type = "n", xlim = c(-0.5, 2.5), ylim = c(-0.5, 1.5),
     xlab = "", ylab = "", axes = FALSE,
     main = "Threshold Graph (τ = 0.7)\nEdges connect variables with |correlation| < 0.7")

# Draw edges (where correlation < 0.7)
edge_color <- rgb(0.2, 0.5, 0.8, 0.6)
edge_lwd <- 2

for (i in 1:4) {
  for (j in 1:4) {
    if (i < j && adj_matrix[i, j]) {
      segments(node_pos[i, 1], node_pos[i, 2],
               node_pos[j, 1], node_pos[j, 2],
               col = edge_color, lwd = edge_lwd)
    }
  }
}

# Draw nodes
node_size <- 0.15
for (i in 1:4) {
  # Node circle
  symbols(node_pos[i, 1], node_pos[i, 2],
          circles = node_size, add = TRUE,
          inches = FALSE, bg = "white", fg = "black", lwd = 2)

  # Node label
  text(node_pos[i, 1], node_pos[i, 2],
       labels = paste0("V", i), cex = 1.2, font = 2)
}

# Add correlation annotations
text(1, 1.35, "cor = 0.85\n(too high, no edge)", cex = 0.8, col = "red")
text(1, -0.35, "cor = 0.75\n(too high, no edge)", cex = 0.8, col = "red")

# Legend showing maximal cliques
legend("right",
       legend = c("Maximal cliques:", "{V1, V3}", "{V1, V4}", "{V2, V3}", "{V2, V4}"),
       bty = "o", bg = "white", cex = 0.9,
       pch = c(NA, 19, 19, 19, 19),
       col = c(NA, "black", "black", "black", "black"))

# Add box around graph
box()

Graph visualization with 4 nodes (V1, V2, V3, V4) arranged in a square pattern. Blue edges connect variable pairs with absolute correlation below 0.7 threshold. Red labels on edges show actual correlation values. Large red circles highlight nodes, with white labels. The graph structure reveals two maximal cliques: {V1,V3} and {V2,V4}, corresponding to maximal subsets where all pairwise correlations remain below threshold.

Graph interpretation:

Maximal cliques (groups where everyone connects to everyone):

  1. {V1, V3}: V1—V3 edge exists ✓, cannot add V2 (no V1—V2 edge) or V4 (no V3—V4 edge)
  2. {V1, V4}: V1—V4 edge exists ✓, cannot add V2 (no V1—V2 edge) or V3 (no V3—V4 edge)
  3. {V2, V3}: V2—V3 edge exists ✓, cannot add V1 (no V1—V2 edge) or V4 (no V3—V4 edge)
  4. {V2, V4}: V2—V4 edge exists ✓, cannot add V1 (no V1—V2 edge) or V3 (no V3—V4 edge)

This visual representation makes the graph-theoretic formulation concrete: finding maximal valid variable subsets is equivalent to finding maximal cliques in the threshold graph.

Network Visualization with cor_example

For larger matrices, we can use igraph to visualize the threshold graph structure. Let’s demonstrate with cor_example, which has 20 variables with known block structure:

data(cor_example)

# Build threshold graph (edges where |correlation| < 0.7)
threshold <- 0.7
adj_mat <- abs(cor_example) < threshold
diag(adj_mat) <- FALSE

if (requireNamespace("igraph", quietly = TRUE)) {
  library(igraph)

  # Create graph from adjacency matrix
  g <- graph_from_adjacency_matrix(adj_mat, mode = "undirected")

  # Find maximal cliques
  cliques <- max_cliques(g)
  cat(sprintf("Found %d maximal cliques at threshold %.1f\n", length(cliques), threshold))

  # Color nodes by which block they belong to
  block_colors <- c(rep("#d73027", 5),   # Block 1 (V1-V5): high correlation
                    rep("#fc8d59", 5),   # Block 2 (V6-V10): moderate
                    rep("#91bfdb", 5),   # Block 3 (V11-V15): low
                    rep("#4575b4", 5))   # Block 4 (V16-V20): minimal

  # Plot network
  par(mar = c(1, 1, 3, 1))
  plot(g,
       vertex.size = 10,
       vertex.color = block_colors,
       vertex.label.cex = 0.8,
       vertex.label.color = "black",
       edge.color = rgb(0.5, 0.5, 0.5, 0.3),
       edge.width = 1,
       layout = layout_with_fr(g),
       main = sprintf("Threshold Graph (τ = %.1f): Variables with |cor| < %.1f are connected",
                     threshold, threshold))

  # Add legend
  legend("topleft",
         legend = c("Block 1 (V1-V5): High cor",
                   "Block 2 (V6-V10): Moderate cor",
                   "Block 3 (V11-V15): Low cor",
                   "Block 4 (V16-V20): Minimal cor"),
         fill = c("#d73027", "#fc8d59", "#91bfdb", "#4575b4"),
         bty = "o", bg = "white", cex = 0.8)
} else {
  cat("Install igraph for network visualization: install.packages('igraph')\n")
  cat("Adjacency matrix (first 5×5 block):\n")
  print(adj_mat[1:5, 1:5] * 1)
}
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
#> Found 5 maximal cliques at threshold 0.7

Network graph visualization of 20 variables organized into 4 correlation blocks. Nodes are colored by block: red (Block 1, V1-V5, high correlation), orange (Block 2, V6-V10, moderate), light blue (Block 3, V11-V15, low), and dark blue (Block 4, V16-V20, minimal). Gray edges connect variables with absolute correlation below 0.7 threshold. The force-directed layout clusters highly correlated variables together, revealing the block structure. Variables within blocks have few connections (high correlation), while variables across blocks have many connections (low correlation), illustrating which combinations can form maximal cliques.

Interpretation:

Finding Maximal Cliques

A clique is a group where every pair is connected. In our graph:

Potential cliques:

Maximal cliques: Can any 2-variable clique be extended?

So there are 4 maximal cliques of size 2, each representing a valid variable subset.

Let corrselect confirm this:

results <- MatSelect(cor_4var, threshold = 0.7, method = "bron-kerbosch")
show(results)
#> CorrCombo object
#> -----------------
#>   Method:      bron-kerbosch
#>   Threshold:   0.700
#>   Subsets:     4 maximal subsets
#>   Data Rows:   4 used in correlation
#>   Pivot:       TRUE
#> 
#> Top combinations:
#>   No.  Variables                          Avg    Max    Size
#>   ------------------------------------------------------------
#>   [ 1] V1, V3                            0.100  0.100     2
#>   [ 2] V2, V3                            0.120  0.120     2
#>   [ 3] V1, V4                            0.150  0.150     2
#>   [ 4] V2, V4                            0.180  0.180     2

Result interpretation:

Key Insight

This toy example shows why corrselect enumerates all solutions:

Real datasets have similar structure but with more variables and more complex clustering.

Key Points — Intuitive Overview

  • corrselect finds variable subsets where no pair exceeds a correlation threshold
  • Multiple valid subsets typically exist; all are enumerated for user choice
  • Graph representation: variables = nodes, low-correlation pairs = edges
  • Maximal cliques in this graph = maximal valid subsets
  • Comparing subsets reveals correlation structure (clusters of related variables)

Problem Formulation

Intuitive Problem Statement

The core problem is straightforward: given a set of \(p\) variables with known pairwise associations (correlations), identify all largest possible subsets where every pair of variables has an association below a user-defined threshold \(\tau\).

Think of this as a social network where variables are people and edges represent “get along well” (low correlation). We want to find all maximal groups of people where everyone gets along with everyone else in the group. A group is “maximal” if we cannot add any more people without introducing a conflict.

Some variables may be designated as “must include” (forced-in). In the social network analogy, these are VIPs who must be in every group, so we only search for groups containing all VIPs.

Formal Problem Statement

Input:

Constraints:

A subset \(S \subseteq \{1, \dots, p\}\) is valid if:

  1. \(F \subseteq S\) (if \(F\) specified)
  2. \(|a_{ij}| < \tau\) for all \(i, j \in S\) with \(i \neq j\)

Objective:

Find all maximal valid subsets \(\mathcal{S} = \{S_1, \dots, S_m\}\) where \(S_k\) is maximal if no variable \(v \notin S_k\) satisfies \(|a_{vi}| < \tau\) for all \(i \in S_k\).

Output:

Collection \(\mathcal{S}\) containing all maximal valid subsets.


Association Matrix

Given \(p\) variables, compute an association matrix \(A \in \mathbb{R}^{p \times p}\) where:

\[ a_{ij} = \text{association}(X_i, X_j) \]

For numeric variables, \(a_{ij}\) may be a correlation coefficient (Pearson, Spearman, Kendall, etc.). For mixed-type variables, \(a_{ij}\) is chosen based on variable types:

Type \((X_i, X_j)\) Measure
numeric, numeric Pearson/Spearman/Kendall
numeric, factor \(\eta^2\)
numeric, ordered Spearman/Kendall
factor, factor Cramér’s V
factor, ordered Cramér’s V
ordered, ordered Spearman/Kendall

All measures are bounded: \(a_{ij} \in [0, 1]\) or \(a_{ij} \in [-1, 1]\).

Threshold Constraint

Fix a threshold \(\tau \in (0, 1)\). A subset \(S \subseteq \{1, \dots, p\}\) is valid if:

\[ \forall i, j \in S,\ i \neq j: \quad |a_{ij}| < \tau \]

Maximal Valid Subsets

A valid subset \(S\) is maximal if no variable \(k \notin S\) satisfies:

\[ |a_{ki}| < \tau \quad \text{for all } i \in S \]

Key Points — Problem Formulation

  • Input: p × p association matrix A, threshold τ ∈ (0,1)
  • Valid subset: S where |a_ij| < τ for all pairs i,j ∈ S
  • Maximal: cannot add any variable without violating threshold
  • Goal: enumerate all maximal valid subsets

Graph-Theoretic Interpretation

Why Graphs?

The variable selection problem has a natural graph representation that connects it to decades of research in computational graph theory. By viewing variables as nodes and “compatible pairs” (low correlation) as edges, we transform the statistical problem into a well-studied graph problem with proven algorithms.

This representation is powerful because:

  1. Efficiency: Graph algorithms exploit structural properties (sparsity, degeneracy) for faster computation
  2. Exactness: We can enumerate all solutions, not just find one
  3. Forced variables: Graph algorithms naturally handle constraints (forced-in sets)

The key insight: a group of mutually compatible variables (valid subset) is exactly a clique in a graph where edges represent compatibility.

Threshold Graph

Define the threshold graph \(G = (V, E)\) where:

Note the reversal: An edge \((i, j)\) means variables \(i\) and \(j\) have low correlation (can coexist). This is the complement of the typical “correlation graph” where edges represent high correlation.

Maximal Cliques

A valid subset \(S\) corresponds to a clique in \(G\): all pairs in \(S\) are connected.

A maximal valid subset corresponds to a maximal clique: a clique that cannot be extended.

Finding all maximal valid subsets is equivalent to enumerating all maximal cliques in \(G\).

Example: 6-Variable Threshold Graph

Consider 6 variables with correlation matrix:

# Create example correlation matrix
set.seed(123)
cor_6var <- matrix(c(
  1.00, 0.85, 0.75, 0.20, 0.15, 0.10,
  0.85, 1.00, 0.80, 0.25, 0.20, 0.15,
  0.75, 0.80, 1.00, 0.30, 0.25, 0.20,
  0.20, 0.25, 0.30, 1.00, 0.65, 0.55,
  0.15, 0.20, 0.25, 0.65, 1.00, 0.60,
  0.10, 0.15, 0.20, 0.55, 0.60, 1.00
), nrow = 6, byrow = TRUE)

rownames(cor_6var) <- colnames(cor_6var) <- paste0("V", 1:6)

# Display correlation matrix
print(round(cor_6var, 2))
#>      V1   V2   V3   V4   V5   V6
#> V1 1.00 0.85 0.75 0.20 0.15 0.10
#> V2 0.85 1.00 0.80 0.25 0.20 0.15
#> V3 0.75 0.80 1.00 0.30 0.25 0.20
#> V4 0.20 0.25 0.30 1.00 0.65 0.55
#> V5 0.15 0.20 0.25 0.65 1.00 0.60
#> V6 0.10 0.15 0.20 0.55 0.60 1.00

Threshold graph construction with \(\tau = 0.7\):

# Build adjacency matrix for threshold graph
tau <- 0.7
adj_matrix <- abs(cor_6var) < tau
diag(adj_matrix) <- FALSE

# Visualize correlation structure and threshold graph
par(mfrow = c(1, 2))

# Panel 1: Correlation heatmap
col_pal <- colorRampPalette(c("#3B4992", "white", "#EE0000"))(100)
image(1:6, 1:6, t(cor_6var[6:1, ]),
      col = col_pal,
      xlab = "", ylab = "",
      main = "Correlation Matrix",
      axes = FALSE,
      zlim = c(-1, 1))
axis(1, at = 1:6, labels = colnames(cor_6var), cex.axis = 0.8)
axis(2, at = 6:1, labels = colnames(cor_6var), cex.axis = 0.8)

# Add correlation values
for (i in 1:6) {
  for (j in 1:6) {
    col_text <- if (abs(cor_6var[j, i]) > 0.5) "white" else "black"
    text(i, 7 - j, sprintf("%.2f", cor_6var[j, i]),
         cex = 0.7, col = col_text)
  }
}
abline(h = 3.5, lwd = 2, lty = 2, col = "black")
abline(v = 3.5, lwd = 2, lty = 2, col = "black")

# Panel 2: Threshold graph (edges where |cor| < tau)
plot.new()
plot.window(xlim = c(0, 1), ylim = c(0, 1))
title(main = sprintf("Threshold Graph (τ = %.1f)", tau))

# Node positions (2 groups based on correlation structure)
pos <- matrix(c(
  0.2, 0.8,  # V1
  0.3, 0.6,  # V2
  0.1, 0.4,  # V3
  0.7, 0.8,  # V4
  0.8, 0.5,  # V5
  0.9, 0.3   # V6
), ncol = 2, byrow = TRUE)

# Draw edges (compatible variable pairs)
for (i in 1:5) {
  for (j in (i + 1):6) {
    if (adj_matrix[i, j]) {
      lines(c(pos[i, 1], pos[j, 1]), c(pos[i, 2], pos[j, 2]),
            col = "gray70", lwd = 1.5)
    }
  }
}

# Draw nodes
points(pos[, 1], pos[, 2], pch = 21, cex = 4,
       bg = c(rep(rgb(0.8, 0.2, 0.2, 0.7), 3),
              rep(rgb(0.2, 0.5, 0.8, 0.7), 3)),
       col = "black", lwd = 2)

# Add labels
text(pos[, 1], pos[, 2], labels = colnames(cor_6var),
     col = "white", font = 2, cex = 0.8)

# Add legend
legend("bottomleft",
       legend = c("Clique 1 (V1-V3)", "Clique 2 (V4-V6)", "Edge: |cor| < 0.7"),
       pch = c(21, 21, NA),
       pt.bg = c(rgb(0.8, 0.2, 0.2, 0.7), rgb(0.2, 0.5, 0.8, 0.7), NA),
       pt.cex = 2,
       lty = c(NA, NA, 1),
       col = c("black", "black", "gray70"),
       lwd = c(2, 2, 1.5),
       bty = "o",
       bg = "white",
       cex = 0.7)

Two side-by-side visualizations for 6-variable example. Left panel: correlation matrix heatmap with blue (negative), white (zero), and red (positive) colors, numerical values overlaid, and black dashed lines separating two correlation blocks. Right panel: threshold graph showing 6 nodes positioned in two groups based on correlation structure, with blue edges connecting variables where absolute correlation is below 0.7. The dual visualization demonstrates how correlation matrix structure translates to threshold graph topology, revealing maximal cliques within and across correlation blocks.


par(mfrow = c(1, 1))

Interpreting the visualization:

The left panel shows the correlation matrix with clear block structure. Variables V1-V3 are highly correlated with each other (correlations 0.75-0.85, shown in red), as are V4-V6 (correlations 0.55-0.65). Between-block correlations are low (0.10-0.30, shown in blue/white).

The right panel shows the corresponding threshold graph with \(\tau = 0.7\). An edge connects two variables if their absolute correlation is below 0.7 (compatible variables). Note that:

This graph structure immediately reveals the two maximal cliques: \(\{V1, V2, V3\}\) and \(\{V4, V5, V6\}\). No variable can be added to either clique without violating the threshold constraint.

Identify maximal cliques:

# Run MatSelect to find all maximal subsets
results <- MatSelect(cor_6var, threshold = 0.7, method = "els")
show(results)
#> CorrCombo object
#> -----------------
#>   Method:      els
#>   Threshold:   0.700
#>   Subsets:     3 maximal subsets
#>   Data Rows:   6 used in correlation
#> 
#> Top combinations:
#>   No.  Variables                          Avg    Max    Size
#>   ------------------------------------------------------------
#>   [ 1] V1, V4, V5, V6                    0.375  0.650     4
#>   [ 2] V2, V4, V5, V6                    0.400  0.650     4
#>   [ 3] V3, V4, V5, V6                    0.425  0.650     4

Interpreting the results:

MatSelect identified two maximal subsets of size 3, exactly matching the visual graph analysis. Both subsets satisfy \(|a_{ij}| < 0.7\) for all pairs within the subset.

Neither subset can be extended: adding any variable from the other group would introduce a pair with correlation above 0.7. For this symmetric example, both subsets have equal size and are equally valid solutions. In practice, you might choose based on domain knowledge (prefer variables with established theory) or downstream model performance.

Key Points — Graph-Theoretic Interpretation

  • Build threshold graph: nodes = variables, edges connect pairs with |a_ij| < τ
  • Maximal valid subsets ↔︎ maximal cliques in threshold graph
  • Proven algorithms (ELS, Bron-Kerbosch) enumerate all maximal cliques efficiently
  • Graph structure reveals variable clustering (densely connected = similar variables)

From Theory to Implementation

The mathematical concepts defined earlier map directly onto the function arguments and behavior of the package. The correspondence is outlined below.

Threshold (\(\tau\)) → threshold argument

Controls which edges appear in the threshold graph.

Maximal cliques → Returned subsets

Forced-in set (\(F\)) → force_in argument

Ensures that certain variables appear in every returned subset.

Search type → mode and method arguments

Choice of enumeration algorithm:

Association matrix (\(A\)) → Data input and matrix-based functions

How the association structure enters the algorithm:

Graph density → Performance considerations

Depends directly on the threshold:

These properties motivate the auto mode: exact for small \(p\), greedy for larger \(p\).

Example mapping

Mathematical formulation:

\[ \text{Find all maximal } S \subseteq \{1,\dots,p\} \text{ such that } |a_{ij}| < 0.7\ \forall i, j \in S,\ \text{with } \{ \text{age} \} \subseteq S. \]

Implementation:

results <- corrSelect(
  data      = mydata,
  threshold = 0.7,      # τ = 0.7
  force_in  = "age",    # F = {age}
  mode      = "exact",  # enumerate all maximal cliques
  method    = "els"     # use ELS algorithm (recommended with force_in)
)

The returned object contains all maximal cliques, each representing a valid variable subset that satisfies the threshold constraint and includes the required variables.

Key Points — From Theory to Implementation

  • threshold = τ (correlation cutoff)
  • force_in = F (variables required in all subsets)
  • mode = "exact" → enumerate all maximal cliques
  • mode = "greedy" → fast heuristic, single subset
  • method = "els" recommended when using force_in

Search Algorithms

Exact Enumeration

Two algorithms enumerate all maximal cliques exactly:

Eppstein–Löffler–Strash (ELS)

Uses degeneracy ordering to structure the search:

  1. Compute degeneracy ordering \(v_1, \dots, v_p\)
  2. For each \(v_i\), extend cliques within candidates \(\{v_{i+1}, \dots, v_p\}\)
  3. Recursively build cliques, pruning when no extension is possible

Formally, define:

\[ \text{extend}(R, P) = \begin{cases} \{R\}, & P = \emptyset \\ \bigcup_{v \in P} \text{extend}(R \cup \{v\}, P \cap N(v)), & \text{otherwise} \end{cases} \]

where \(N(v)\) denotes neighbors of \(v\) in \(G\).

Bron–Kerbosch

Classical recursive backtracking with optional pivoting.

Let \(R\) = current clique, \(P\) = candidates, \(X\) = excluded nodes:

\[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset \\ \text{for each } v \in P \setminus N(u): \\ \quad \text{BK}(R \cup \{v\}, P \cap N(v), X \cap N(v)) \\ \quad P \leftarrow P \setminus \{v\}, \quad X \leftarrow X \cup \{v\} \end{cases} \]

Pivot \(u \in P \cup X\) reduces recursive calls.

Pseudocode for Practitioners

Eppstein-Löffler-Strash (ELS) Algorithm:

Algorithm: ELS_MaxCliques(Graph G, ForceIn F)
Input: Threshold graph G = (V, E), forced variables F ⊆ V
Output: All maximal cliques containing F

1. Validate F forms a clique in G
2. Compute degeneracy ordering: v₁, v₂, ..., vₚ
3. Initialize results = []

4. For each vertex vᵢ in ordering:
   a. If vᵢ ∉ F and vᵢ not adjacent to all in F:
      Skip vᵢ (cannot extend F)

   b. candidates = {vⱼ : j > i and vⱼ adjacent to vᵢ} ∩ N(F)
   c. Extend(clique = {vᵢ} ∪ F, candidates, results)

5. Return results

Subroutine: Extend(R, P, results)
  If P is empty:
    Add R to results if maximal
    Return

  For each v in P:
    neighbors = P ∩ N(v)
    Extend(R ∪ {v}, neighbors, results)

Bron-Kerbosch Algorithm:

Algorithm: BronKerbosch(Graph G, use_pivot = TRUE)
Input: Threshold graph G = (V, E), pivot flag
Output: All maximal cliques

1. Initialize: R = ∅ (current clique)
              P = V (candidates)
              X = ∅ (excluded)
2. Call BK(R, P, X, results)
3. Return results

Subroutine: BK(R, P, X, results)
  If P = ∅ and X = ∅:
    Add R to results (R is maximal)
    Return

  If use_pivot:
    Choose pivot u from P ∪ X with max |P ∩ N(u)|
    iterate = P \ N(u)  # Skip neighbors of pivot
  Else:
    iterate = P

  For each vertex v in iterate:
    BK(R ∪ {v},
       P ∩ N(v),     # Only candidates adjacent to v
       X ∩ N(v),     # Only excluded adjacent to v
       results)

    P = P \ {v}      # Remove v from candidates
    X = X ∪ {v}      # Add v to excluded

Complexity:

  • ELS: \(O(d \cdot 3^{d/3})\) where \(d\) is degeneracy (much faster on sparse graphs)
  • Bron-Kerbosch: \(O(3^{p/3})\) worst case, improved with pivoting
  • Greedy (below): \(O(p^2)\) deterministic polynomial time

Greedy Heuristic

Iteratively removes variables with highest average absolute association until all pairs satisfy \(|a_{ij}| < \tau\).

Returns a single valid subset (not necessarily maximal or optimal).

Complexity: \(O(p^2)\) vs \(O(2^p)\) for exact enumeration.

Key Points — Search Algorithms

  • Exact mode: ELS or Bron-Kerbosch enumerate all maximal cliques
  • ELS: O(d · 3^{d/3}), faster on sparse graphs, better with force_in
  • Bron-Kerbosch: O(3^{p/3}), pivoting improves performance
  • Greedy: O(p²), fast but returns single (possibly non-optimal) subset
  • Rule of thumb: exact for p ≤ 20, greedy for p > 20

Algorithm Pseudocode

Eppstein–Löffler–Strash (ELS)

Input: Graph \(G = (V, E)\), forced-in set \(F\)

Output: All maximal cliques containing \(F\)

Algorithm ELS(G, F):
  # Step 1: Compute degeneracy ordering
  deg_order ← ComputeDegeneracyOrdering(G)

  # Step 2: Initialize with forced-in variables
  R ← F  (current clique)
  P ← V \ F  (candidate vertices)
  X ← ∅  (excluded vertices)

  # Step 3: Validate forced-in set
  for each i, j ∈ F:
    if (i, j) ∉ E:
      return ∅  (infeasible)

  # Step 4: Recursively enumerate maximal cliques
  return EnumerateCliques(R, P, X, deg_order)

Subroutine EnumerateCliques(R, P, X, ordering):
  # Base case: no candidates, no exclusions → maximal clique
  if P = ∅ and X = ∅:
    report R as maximal clique
    return

  # Recursive case: extend with each candidate
  for each v ∈ P (in degeneracy order):
    # Extend clique
    R' ← R ∪ {v}

    # Update candidates: keep only neighbors of v
    P' ← P ∩ N(v)

    # Update exclusions: keep only neighbors of v
    X' ← X ∩ N(v)

    # Recurse
    EnumerateCliques(R', P', X', ordering)

    # Move v from candidates to exclusions
    P ← P \ {v}
    X ← X ∪ {v}

Subroutine ComputeDegeneracyOrdering(G):
  # Degeneracy ordering: v_1, ..., v_p where each v_i has
  # minimum degree in G[{v_i, ..., v_p}]

  ordering ← [ ]
  remaining ← V

  while remaining ≠ ∅:
    # Find vertex with minimum degree in induced subgraph
    v ← argmin_{u ∈ remaining} |N(u) ∩ remaining|

    # Add to ordering (reverse order)
    ordering.prepend(v)
    remaining ← remaining \ {v}

  return ordering

Complexity: \(O(d \cdot 3^{d/3})\) where \(d\) is the degeneracy of \(G\).

Properties:


Bron–Kerbosch with Pivoting

Input: Graph \(G = (V, E)\), forced-in set \(F\)

Output: All maximal cliques containing \(F\)

Algorithm BronKerbosch(G, F):
  # Step 1: Initialize
  R ← F  (current clique)
  P ← V \ F  (candidate vertices)
  X ← ∅  (excluded vertices)

  # Step 2: Validate forced-in set
  for each i, j ∈ F:
    if (i, j) ∉ E:
      return ∅  (infeasible)

  # Step 3: Restrict candidates to neighbors of forced-in set
  if F ≠ ∅:
    P ← P ∩ (⋂_{v ∈ F} N(v))

  # Step 4: Enumerate
  return BK(R, P, X)

Subroutine BK(R, P, X):
  # Base case: no candidates, no exclusions → maximal clique
  if P = ∅ and X = ∅:
    report R as maximal clique
    return

  # Choose pivot: vertex with most neighbors in P
  u ← argmax_{v ∈ P ∪ X} |N(v) ∩ P|

  # Iterate over non-neighbors of pivot (reduces recursion)
  for each v ∈ P \ N(u):
    # Extend clique
    R' ← R ∪ {v}

    # Update candidates: keep only neighbors of v
    P' ← P ∩ N(v)

    # Update exclusions: keep only neighbors of v
    X' ← X ∩ N(v)

    # Recurse
    BK(R', P', X')

    # Move v from candidates to exclusions
    P ← P \ {v}
    X ← X ∪ {v}

Complexity: \(O(3^{p/3})\) maximal cliques (worst-case).

Properties:

Pivot selection:


Greedy Heuristic

Input: Association matrix \(A\), threshold \(\tau\), forced-in set \(F\)

Output: Single valid subset (not necessarily maximal)

Algorithm GreedyPrune(A, τ, F):
  # Step 1: Initialize with all variables
  S ← {1, ..., p}

  # Step 2: Validate forced-in set
  if F ≠ ∅:
    for each i, j ∈ F:
      if |a_ij| ≥ τ:
        return ∅  (infeasible)

  # Step 3: Iteratively remove highest-correlated variables
  while ∃ i, j ∈ S: |a_ij| ≥ τ:
    # Compute average absolute correlation for each variable
    for each v ∈ S \ F:
      avg[v] ← (1 / |S| - 1) × Σ_{u ∈ S, u ≠ v} |a_vu|

    # Remove variable with highest average correlation
    v_max ← argmax_{v ∈ S \ F} avg[v]
    S ← S \ {v_max}

  # Step 4: Return pruned subset
  return S

Complexity:
\(O(p^2 k)\), where \(k\) is the number of variables removed.

Properties:

Tie-breaking: When several variables share the same average absolute correlation:

  1. Prefer the variable with the lower median absolute correlation
  2. If still tied, prefer the lexicographically first variable name

Forced Variables

Constraint: variables \(F \subseteq \{1, \dots, p\}\) must appear in all returned subsets.

Graph Modification

Modify the search:

Formally, find maximal cliques in \(G\) containing \(F\).


Correlation vs Association

Correlation: \(a_{ij} \in [-1, 1]\), use \(|a_{ij}|\) in threshold constraint

Association: \(a_{ij} \in [0, 1]\), use \(a_{ij}\) directly

Mixed-type data uses association matrix with measures chosen per variable-pair type.


Complexity Analysis

Exact Enumeration

Worst-case: \(O(3^{p/3})\) maximal cliques possible

Performance depends on graph density:

Greedy Heuristic

Time: \(O(p^2 k)\) where \(k\) = iterations

Space: \(O(p^2)\) for storing associations

Deterministic: same input produces same output


Output Structure

All selection functions return a CorrCombo S4 object containing:

Results are sorted by:

  1. Subset size (descending)
  2. Average absolute association (ascending)

Design Philosophy

This section explains key design decisions underlying corrselect, addressing common questions about why certain choices were made.

Why “Maximal” Not “Maximum”?

Maximal: Cannot be extended further (locally optimal) Maximum: Largest possible size (globally optimal)

corrselect enumerates all maximal subsets rather than just finding the single maximum subset. Why?

  1. Multiple equally good solutions: Real datasets often have many maximal subsets of equal or similar size. Returning only one discards valuable information about alternative variable combinations.

  2. Domain knowledge integration: Users may prefer a slightly smaller subset containing specific variables over the globally largest subset. Having all options enables informed choice.

  3. Sensitivity analysis: Comparing multiple maximal subsets reveals structural properties of the correlation matrix (e.g., tight vs loosely connected clusters).

  4. Computational feasibility: Finding the maximum clique is NP-complete and often harder than enumerating all maximal cliques in practice. Modern maximal clique algorithms (ELS, Bron-Kerbosch) are highly efficient for typical correlation structures.

Why Hard Threshold Not Soft Constraint?

corrselect enforces a hard threshold: \(|a_{ij}| < \tau\) for all pairs. Alternative approaches use soft constraints (penalty functions, regularization). Why hard thresholds?

  1. Interpretability: “No pair exceeds \(\tau\)” is a clear, verifiable guarantee. Soft constraints produce solutions where some pairs may exceed \(\tau\) with unknown magnitude.

  2. Reproducibility: Hard thresholds produce deterministic results. Soft constraints often require tuning parameters (penalty weights, convergence criteria) that affect reproducibility.

  3. Domain-specific requirements: Fields like ecological modeling have established thresholds (e.g., \(\tau = 0.7\) for WorldClim variables) based on empirical evidence. Hard thresholds directly implement these guidelines.

  4. Exact enumeration: Hard constraints enable graph-theoretic formulation with exact algorithms. Soft constraints typically require heuristic optimization.

Why Graph Algorithms Not Optimization?

corrselect uses specialized graph algorithms (ELS, Bron-Kerbosch) rather than general optimization frameworks (integer programming, metaheuristics). Why?

  1. Asymptotic efficiency: Graph algorithms exploit structural properties (sparsity, degeneracy) unavailable to generic solvers. For sparse graphs (low \(\tau\)), this yields orders-of-magnitude speedups.

  2. Exact enumeration: Graph algorithms guarantee finding all maximal cliques. Optimization approaches typically find one solution.

  3. Forced variables: Graph algorithms naturally handle forced-in constraints via initialization. Optimization approaches require additional constraints that may degrade performance.

  4. Established theory: Maximal clique enumeration has 50+ years of algorithmic development with proven complexity bounds and implementation strategies.

Why Pairwise Associations Only?

corrselect considers only pairwise associations, not higher-order interactions (partial correlations, conditional independence). Why?

  1. Computational tractability: Higher-order interactions require exponentially more computation. For \(p\) variables, pairwise methods scale as \(O(p^2)\), while \(k\)-way interactions scale as \(O(p^k)\).

  2. Clear interpretation: “Variables \(i\) and \(j\) correlate at \(r = 0.85\)” is directly interpretable. Partial correlations require careful conditioning set selection and can be counterintuitive.

  3. Robustness: Pairwise correlations are stable with moderate sample sizes. Partial correlations require \(n \gg p\) and are sensitive to model misspecification.

  4. Method generality: Pairwise associations work for any variable type (numeric, categorical, mixed). Higher-order methods often require strong distributional assumptions.

When higher-order methods are appropriate: If your goal is causal discovery (identifying conditional independence structure), use methods like PC algorithm or constraint-based causal inference. corrselect focuses on reducing redundancy for predictive modeling and descriptive analysis.

Why Enumerate All Solutions Not Just Return One?

Many correlation-pruning tools (e.g., caret::findCorrelation) return a single subset. corrselect can return all maximal subsets (exact mode) or one (greedy mode). Why offer exhaustive enumeration?

  1. Alternative solutions: Multiple maximal subsets represent genuinely different variable combinations that all satisfy the threshold constraint. Arbitrarily choosing one discards information.

  2. Downstream analysis: Different subsets may be preferred for different models or research questions. Enumerating all options enables post-hoc selection criteria.

  3. Documentation: For reproducible research (especially JOSS/CRAN submissions), documenting all valid solutions provides transparency about algorithmic choices.

  4. Computational cost: For typical use cases (\(p \leq 30\), \(\tau \geq 0.5\)), exhaustive enumeration completes in milliseconds. When infeasible, greedy mode provides a fast approximation.

When to use greedy mode: High-dimensional data (\(p > 50\)), dense correlation structure (high \(\tau\)), or when a single solution suffices (e.g., automated pipelines).

Key Points — Design Philosophy

  • Maximal not maximum: All locally optimal subsets enumerated, not just the globally largest
  • Hard threshold: Simple, interpretable guarantee vs. soft penalty optimization
  • Graph algorithms: Leverage decades of research; proven correctness and complexity bounds
  • Pairwise only: Computational tractability, clear interpretation, robustness
  • Enumerate all: Preserves information for downstream choice; greedy available when speed critical

References

Graph-Theoretic Algorithms

Maximal clique enumeration

Graph degeneracy:

Independent sets and vertex cover:

Multicollinearity and Variable Selection

Variance inflation factor (VIF):

Correlation-based variable selection:

Association Measures

Numeric associations:

Categorical associations:

Mixed-type associations:

Threshold Graph Theory

Computational Complexity

Applications

Bioclimatic modeling:

Genomics:


See Also

Session Info

sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] igraph_2.1.4         car_3.1-3            carData_3.0-5       
#> [4] microbenchmark_1.5.0 corrselect_3.0.2    
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1     timeDate_4051.111    dplyr_1.1.4         
#>  [4] farver_2.1.2         S7_0.2.0             fastmap_1.2.0       
#>  [7] pROC_1.19.0.1        caret_7.0-1          digest_0.6.37       
#> [10] rpart_4.1.24         timechange_0.3.0     lifecycle_1.0.4     
#> [13] survival_3.8-3       magrittr_2.0.4       compiler_4.5.1      
#> [16] rlang_1.1.6          sass_0.4.10          tools_4.5.1         
#> [19] yaml_2.3.10          Boruta_9.0.0         data.table_1.17.8   
#> [22] knitr_1.50           plyr_1.8.9           RColorBrewer_1.1-3  
#> [25] abind_1.4-8          withr_3.0.2          purrr_1.2.0         
#> [28] nnet_7.3-20          grid_4.5.1           stats4_4.5.1        
#> [31] future_1.67.0        ggplot2_4.0.0        globals_0.18.0      
#> [34] scales_1.4.0         iterators_1.0.14     MASS_7.3-65         
#> [37] cli_3.6.5            rmarkdown_2.30       generics_0.1.4      
#> [40] rstudioapi_0.17.1    future.apply_1.20.0  reshape2_1.4.4      
#> [43] cachem_1.1.0         stringr_1.5.2        splines_4.5.1       
#> [46] parallel_4.5.1       vctrs_0.6.5          hardhat_1.4.2       
#> [49] glmnet_4.1-10        Matrix_1.7-4         jsonlite_2.0.0      
#> [52] Formula_1.2-5        listenv_0.9.1        systemfonts_1.3.1   
#> [55] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
#> [58] recipes_1.3.1        glue_1.8.0           parallelly_1.45.1   
#> [61] codetools_0.2-20     lubridate_1.9.4      stringi_1.8.7       
#> [64] gtable_0.3.6         shape_1.4.6.1        tibble_3.3.0        
#> [67] pillar_1.11.1        htmltools_0.5.8.1    ipred_0.9-15        
#> [70] lava_1.8.2           R6_2.6.1             textshaping_1.0.3   
#> [73] evaluate_1.0.5       lattice_0.22-7       bslib_0.9.0         
#> [76] class_7.3-23         Rcpp_1.1.0           svglite_2.2.2       
#> [79] nlme_3.1-168         prodlim_2025.04.28   ranger_0.17.0       
#> [82] xfun_0.53            ModelMetrics_1.2.2.2 pkgconfig_2.0.3