0.1 Single-Cell RNA-Seq Analysis with XYomics
1 Introduction
2 Data Simulation and Preprocessing
3 Differential Expression Analysis Strategies
- 3.1 Method 1: Sex-Stratified Analysis
- 3.2 Method 2: Sex-Phenotype Interaction Analysis
4 Visualization of Gene Expression
5 Pathway Enrichment Analysis
6 Protein-protein interaction Network Analysis
- 6.1 Building the Network
- 6.2 Visualizing the Network
7 Generating a Report

0.1 Single-Cell RNA-Seq Analysis with XYomics

Authors: Sophie Le Bars, Mohamed Soudy, and Enrico Glaab

1 Introduction

This vignette provides a guide for using the XYomics package to analyze sex-related patterns in single-cell RNA-sequencing (scRNA-seq) data. We will walk through a complete workflow, from data simulation to network analysis, demonstrating how to identify and interpret sex-specific gene expression changes at the cell-type level.

The tutorial covers: 1. Simulating a scRNA-seq dataset. 2. Standard preprocessing using the Seurat package. 3. Performing differential expression analysis using different strategies: sex-stratified analysis and interaction analysis. 4. Detecting and categorizing sex-specific DEGs for each cell type. 5. Visualizing gene expression using dot plots. 6. Performing pathway enrichment analysis to uncover biological functions. 7. Constructing and visualizing protein-protein networks.

2 Data Simulation and Preprocessing

2.1 Simulating Single-Cell Data

We start by simulating a scRNA-seq dataset containing 10 mock samples. This creates a Seurat object with a count matrix ready for analysis.

library(Seurat)
library(org.Hs.eg.db)
library(stringr)
library(clusterProfiler)
library(igraph)
library(XYomics)
library(ggrepel)
library(ggraph)
library(dplyr)
library(tidyr)

set.seed(123)
gene_sample <- keys(org.Hs.eg.db, keytype = "SYMBOL") %>% 
  as.data.frame() %>% 
  setNames("genes") %>%
  filter(!str_starts(genes, "LOC")) 

gene_sample <- sample(gene_sample$genes, 300, replace = FALSE)

matrices <- lapply(1:10, function(i) {
  batch_effect <- rgamma(300, shape = 2, scale = 1.5)
  m <- matrix(
    ifelse(runif(300*100) < 0.1, 0, rnbinom(300*100, size = 1/0.5, mu = 10 * batch_effect)),
    nrow = 300,
    dimnames = list(gene_sample, paste("Sample", i, ": Cell", 1:100))
  )
  CreateSeuratObject(counts = m, meta.data = data.frame(sample = paste("Sample", i)))
})

sim.seurat <- Reduce(merge, matrices)
sim.seurat@meta.data$sample <- sapply(strsplit(rownames(sim.seurat@meta.data), ":"), "[", 1)

2.2 Preprocessing and Clustering

We perform standard scRNA-seq preprocessing, including normalization, feature selection, and scaling. We then apply dimensionality reduction (PCA and UMAP) and clustering to identify cell populations.

sim.seurat <- NormalizeData(sim.seurat)
sim.seurat <- FindVariableFeatures(sim.seurat, selection.method = "vst", nfeatures = 2000)
sim.seurat <- ScaleData(sim.seurat)
sim.seurat <- RunPCA(sim.seurat)
sim.seurat <- FindNeighbors(sim.seurat, dims = 1:10)
sim.seurat <- FindClusters(sim.seurat)
sim.seurat <- RunUMAP(sim.seurat, dims = 1:10)
DimPlot(sim.seurat)

2.3 Annotating Cell Types and Conditions

For the analysis, the Seurat object must contain metadata columns for cell type (cell_type), sex (sex), and condition (status). Here, we add mock annotations to our simulated data.

# Assign mock cell types
cellTypes <- c("cell type 1", "cell type 2", "cell type 3", "cell type 4", "cell type 5")
sim.seurat@meta.data$cell_type <- sample(cellTypes, nrow(sim.seurat@meta.data), replace = TRUE)
Idents(sim.seurat) <- "cell_type"

# Assign mock status (WT/TG) and sex (M/F)
samples <- sim.seurat@meta.data$sample
sim.seurat@meta.data$status <- ifelse(grepl("1|3|5|8|9", samples), "TG", "WT")
sim.seurat@meta.data$sex <- ifelse(grepl("1|2|4|8|10", samples), "M", "F")

3 Differential Expression Analysis Strategies

When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.

Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.

Interaction analysis (using a multiplicative term like sex*disease) formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.

The XYomics package provides functions for both types of analyses.

3.1 Method 1: Sex-Stratified Analysis

We use sex_stratified_analysis_sc() to identify DEGs between conditions separately for males and females within each cell type.

# Run for all cell types
sim.seurat <- JoinLayers(sim.seurat) #this is for the simulated object but not always required for your own object
results <- sex_stratified_analysis_sc(sim.seurat, min_logfc = -Inf)

sex_degs <- lapply(levels(as.factor(sim.seurat@meta.data$cell_type)), function(cell) {
  list(
    male = results$male_DEGs[[cell]],
    female = results$female_DEGs[[cell]]
  )
})

names(sex_degs) <- levels(as.factor(sim.seurat@meta.data$cell_type))
result_categories <- lapply(sex_degs, function(degs) categorize_sex_sc(degs$male, degs$female))

# Example for one cell type
result_one <- result_categories$`cell type 1`
cat("\nTop categorized DEGs for 'cell type 1' (from stratified analysis):\n")

## 
## Top categorized DEGs for 'cell type 1' (from stratified analysis):

head(result_one)

##                     DEG_Type Gene_Symbols Male_avg_logFC     Male_FDR
## male-specific1 male-specific         CTSH      -1.965567 1.139671e-06
## male-specific2 male-specific    GOLGA6L24      -1.938910 5.096954e-06
## male-specific3 male-specific       TRAV41       2.146682 7.169862e-06
## male-specific4 male-specific         FAN1      -2.102872 8.528492e-06
## male-specific5 male-specific         CUTC       2.173129 1.440704e-05
## male-specific6 male-specific      SCML2P1       1.785135 4.082302e-05
##                Female_avg_logFC Female_FDR
## male-specific1       0.22074818  0.7717056
## male-specific2       0.35324695  0.9197194
## male-specific3       0.55890099  0.8394288
## male-specific4      -0.09405884  0.6653794
## male-specific5      -0.21954437  0.7529602
## male-specific6       0.19363191  0.9925615

3.2 Method 2: Sex-Phenotype Interaction Analysis

We use sex_interaction_analysis_sc() to perform a formal interaction analysis, directly comparing the condition effect between sexes.

# Example for one cell type (e.g., "cell type 1")
target_cell <- "cell type 1"
interaction_results_one_cell <- sex_interaction_analysis_sc(sim.seurat, target_cell_type = target_cell)
cat(paste0("\nSummary of Sex-Phenotype Interaction analysis results for '", target_cell, "':\n"))

## 
## Summary of Sex-Phenotype Interaction analysis results for 'cell type 1':

print(interaction_results_one_cell$summary_stats)

##     cell_type n_total_genes n_sig_genes
## 1 cell type 1           300          86

cat(paste0("\nTop genes from Sex-Phenotype Interaction analysis for '", target_cell, "':\n"))

## 
## Top genes from Sex-Phenotype Interaction analysis for 'cell type 1':

print(head(interaction_results_one_cell$all_results[[1]]))

##                logFC  AveExpr         t      P.Value    adj.P.Val         B
## CACYBP      3.695025 10.29579  6.019264 4.235834e-09 5.428263e-07 10.427515
## CSF2        3.887220 10.07678  6.017174 4.285904e-09 5.428263e-07 10.416950
## PLA2G2E    -3.571534 10.30650 -5.975030 5.428263e-09 5.428263e-07 10.194008
## SPATA31D2P -3.569100 10.69454 -5.867508 9.861416e-09 7.396062e-07  9.631411
## GSX1       -3.408116 10.52532 -5.495808 7.273736e-08 4.001010e-06  7.754312
## TMEM116    -3.371443 10.20957 -5.477563 8.002019e-08 4.001010e-06  7.664884
##              cell_type       gene
## CACYBP     cell type 1     CACYBP
## CSF2       cell type 1       CSF2
## PLA2G2E    cell type 1    PLA2G2E
## SPATA31D2P cell type 1 SPATA31D2P
## GSX1       cell type 1       GSX1
## TMEM116    cell type 1    TMEM116

4 Visualization of Gene Expression

Dot plots are an effective way to visualize gene expression in scRNA-seq data. Here, we show the expression of the top male-specific, female-specific, and sex-dimorphic genes across all cell types, split by sex and condition.

# Get top gene from each category for 'cell type 1'
top_male_gene <- result_one %>% filter(DEG_Type == "male-specific") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_female_gene <- result_one %>% filter(DEG_Type == "female-specific") %>% top_n(1, -Female_FDR) %>% pull(Gene_Symbols)
top_dimorphic_gene <- result_one %>% filter(DEG_Type == "sex-dimorphic") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)

top_genes <- c(top_male_gene, top_female_gene, top_dimorphic_gene)

# Create dot plot
if (length(top_genes) > 0) {
  sim.seurat$group_plot <- paste(sim.seurat$cell_type, sim.seurat$sex, sim.seurat$status, sep = "_")
  Idents(sim.seurat) <- "group_plot"
  # Seurat v5 requires specifying assay for DotPlot
  DotPlot(sim.seurat, features = top_genes, cols = c("blue", "red"), assay = "RNA") +
    coord_flip() +
    labs(title = "Expression of Top Sex-Specific and Dimorphic Genes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
}

5 Pathway Enrichment Analysis

We perform pathway enrichment analysis to identify biological pathways over-represented in our DEG categories. The categorized_enrich_sc function automates this for all categories.

# Run for all cell types
pathway_category <- lapply(result_categories, function(cell) categorized_enrich_sc(cell))
cat("\nTop pathway results for 'cell type 1':\n")

## 
## Top pathway results for 'cell type 1':

print(head(pathway_category$`cell type 1`))

## $male_specific
##                                      category
## hsa03460       Genetic Information Processing
## hsa04142                   Cellular Processes
## hsa04210                   Cellular Processes
## hsa04514 Environmental Information Processing
##                                  subcategory       ID
## hsa03460              Replication and repair hsa03460
## hsa04142            Transport and catabolism hsa04142
## hsa04210               Cell growth and death hsa04210
## hsa04514 Signaling molecules and interaction hsa04514
##                                       Description GeneRatio  BgRatio     pvalue
## hsa03460                   Fanconi anemia pathway       1/3  55/9512 0.01724821
## hsa04142                                 Lysosome       1/3 133/9512 0.04136752
## hsa04210                                Apoptosis       1/3 137/9512 0.04259365
## hsa04514 Cell adhesion molecule (CAM) interaction       1/3 160/9512 0.04962364
##            p.adjust qvalue geneID Count
## hsa03460 0.04962364     NA  22909     1
## hsa04142 0.04962364     NA   1512     1
## hsa04210 0.04962364     NA   1512     1
## hsa04514 0.04962364     NA  80381     1
## 
## $female_specific
##  [1] category    subcategory ID          Description GeneRatio   BgRatio    
##  [7] pvalue      p.adjust    qvalue      geneID      Count      
## <0 rows> (or 0-length row.names)
## 
## $sex_dimorphic
##                    category      subcategory       ID        Description
## hsa04978 Organismal Systems Digestive system hsa04978 Mineral absorption
##          GeneRatio BgRatio      pvalue   p.adjust     qvalue   geneID Count
## hsa04978       2/7 61/9512 0.000832167 0.03661535 0.02627896 7018/491     2
## 
## $sex_neutral
##  [1] category    subcategory ID          Description GeneRatio   BgRatio    
##  [7] pvalue      p.adjust    qvalue      geneID      Count      
## <0 rows> (or 0-length row.names)

# Run for one specific cell type
pathway_category_one <- categorized_enrich_sc(result_one)
cat("\nTop pathway results for 'cell type 1' (re-run separately):\n")

## 
## Top pathway results for 'cell type 1' (re-run separately):

print(head(pathway_category_one))

## $male_specific
##                                      category
## hsa03460       Genetic Information Processing
## hsa04142                   Cellular Processes
## hsa04210                   Cellular Processes
## hsa04514 Environmental Information Processing
##                                  subcategory       ID
## hsa03460              Replication and repair hsa03460
## hsa04142            Transport and catabolism hsa04142
## hsa04210               Cell growth and death hsa04210
## hsa04514 Signaling molecules and interaction hsa04514
##                                       Description GeneRatio  BgRatio     pvalue
## hsa03460                   Fanconi anemia pathway       1/3  55/9512 0.01724821
## hsa04142                                 Lysosome       1/3 133/9512 0.04136752
## hsa04210                                Apoptosis       1/3 137/9512 0.04259365
## hsa04514 Cell adhesion molecule (CAM) interaction       1/3 160/9512 0.04962364
##            p.adjust qvalue geneID Count
## hsa03460 0.04962364     NA  22909     1
## hsa04142 0.04962364     NA   1512     1
## hsa04210 0.04962364     NA   1512     1
## hsa04514 0.04962364     NA  80381     1
## 
## $female_specific
##  [1] category    subcategory ID          Description GeneRatio   BgRatio    
##  [7] pvalue      p.adjust    qvalue      geneID      Count      
## <0 rows> (or 0-length row.names)
## 
## $sex_dimorphic
##                    category      subcategory       ID        Description
## hsa04978 Organismal Systems Digestive system hsa04978 Mineral absorption
##          GeneRatio BgRatio      pvalue   p.adjust     qvalue   geneID Count
## hsa04978       2/7 61/9512 0.000832167 0.03661535 0.02627896 7018/491     2
## 
## $sex_neutral
##  [1] category    subcategory ID          Description GeneRatio   BgRatio    
##  [7] pvalue      p.adjust    qvalue      geneID      Count      
## <0 rows> (or 0-length row.names)

6 Protein-protein interaction Network Analysis

Finally, we construct a protein-protein interaction network to explore the interactions between our DEGs.

6.1 Building the Network

We first fetch a protein-protein interaction network from the STRING database. Then, we use the PCSF algorithm via construct_ppi_pcsf to extract a relevant subnetwork based on “prizes” derived from our DEG p-values.

# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)

# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # This should load the 'g' variable

### Run for all cell type


network_results <- list()

for (cell_type in names(result_categories)) {
  
  # Extract DEG results for the current cell type
  cell_results <- result_categories[[cell_type]]
  
  # Filter DEGs by type
  male_specific       <- cell_results[cell_results$DEG_Type == "male-specific", ]
  female_specific     <- cell_results[cell_results$DEG_Type == "female-specific", ]
  sex_dimorphic       <- cell_results[cell_results$DEG_Type == "sex-dimorphic", ]
  sex_neutral         <- cell_results[cell_results$DEG_Type == "sex-neutral", ]
  
  # Convert to log-transformed prizes
  male_prizes <- -log10(male_specific$Male_FDR)
  names(male_prizes) <- male_specific$Gene_Symbols
  
  female_prizes <- -log10(female_specific$Female_FDR)
  names(female_prizes) <- female_specific$Gene_Symbols
  
  dimorphic_prizes <- -log10((sex_dimorphic$Male_FDR + sex_dimorphic$Female_FDR) / 2)
  names(dimorphic_prizes) <- sex_dimorphic$Gene_Symbols
  
  neutral_prizes <- -log10((sex_neutral$Male_FDR + sex_neutral$Female_FDR) / 2)
  names(neutral_prizes) <- sex_neutral$Gene_Symbols
  
  # Construct ppis
  male_network     <- construct_ppi_pcsf(g = g, prizes = male_prizes)
  female_network   <- construct_ppi_pcsf(g = g, prizes = female_prizes)
  dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)
  neutral_network   <- construct_ppi_pcsf(g = g, prizes = neutral_prizes)
  
  # Store all results per cell type
  network_results[[cell_type]] <- list(
    male_network      = male_network,
    female_network    = female_network,
    dimorphic_network = dimorphic_network,
    neutral_network   = neutral_network
  )
}

network_results

## $`cell type 1`
## $`cell type 1`$male_network
## IGRAPH 33db430 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33db430 (vertex names):
## 
## $`cell type 1`$female_network
## IGRAPH 33dc103 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33dc103 (vertex names):
## 
## $`cell type 1`$dimorphic_network
## IGRAPH 33dd866 UNWB 5 4 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33dd866 (vertex names):
## [1] CACYBP--GSK3A  DVL3  --GSK3A  GSK3B --POLR1E CACYBP--GSK3B 
## 
## $`cell type 1`$neutral_network
## IGRAPH 33deb3a UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33deb3a (vertex names):
## 
## 
## $`cell type 2`
## $`cell type 2`$male_network
## IGRAPH 33e0798 UNWB 3 2 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33e0798 (vertex names):
## [1] CCL2--CXCR2 CCL2--CSF2 
## 
## $`cell type 2`$female_network
## IGRAPH 33e3467 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33e3467 (vertex names):
## 
## $`cell type 2`$dimorphic_network
## IGRAPH 33e4ec2 UNWB 3 2 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33e4ec2 (vertex names):
## [1] FBXO43--SKP1 CACYBP--SKP1
## 
## $`cell type 2`$neutral_network
## IGRAPH 33e5d59 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33e5d59 (vertex names):
## 
## 
## $`cell type 3`
## $`cell type 3`$male_network
## IGRAPH 33e709c UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33e709c (vertex names):
## 
## $`cell type 3`$female_network
## IGRAPH 33e8def UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33e8def (vertex names):
## 
## $`cell type 3`$dimorphic_network
## IGRAPH 33eb24c UNWB 7 6 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33eb24c (vertex names):
## [1] CACYBP--GSK3B  CACYBP--GSK3A  DVL3  --GSK3A  POLR1E--POLR2K POLR2K--SUZ12 
## [6] GSK3B --POLR1E
## 
## $`cell type 3`$neutral_network
## IGRAPH 33ed90a UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33ed90a (vertex names):
## 
## 
## $`cell type 4`
## $`cell type 4`$male_network
## IGRAPH 33ef09d UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33ef09d (vertex names):
## 
## $`cell type 4`$female_network
## IGRAPH 33f09e1 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33f09e1 (vertex names):
## 
## $`cell type 4`$dimorphic_network
## IGRAPH 33f2c27 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33f2c27 (vertex names):
## 
## $`cell type 4`$neutral_network
## IGRAPH 33f54b9 UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33f54b9 (vertex names):
## 
## 
## $`cell type 5`
## $`cell type 5`$male_network
## IGRAPH 33f66ed UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33f66ed (vertex names):
## 
## $`cell type 5`$female_network
## IGRAPH 33f7b31 UNWB 3 2 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33f7b31 (vertex names):
## [1] SEC13--SUMO3 PREB --SEC13
## 
## $`cell type 5`$dimorphic_network
## IGRAPH 33f9a2f UNWB 3 2 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33f9a2f (vertex names):
## [1] POLR1E--POLR2K POLR2K--SUZ12 
## 
## $`cell type 5`$neutral_network
## IGRAPH 33fb49b UNWB 0 0 -- 
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from 33fb49b (vertex names):

# Example for one specific cell type

# Create prizes from dimorphic DEGs in 'cell type 1'
dimorphic_specific <- result_one[result_one$DEG_Type == "sex-dimorphic", ]
dimorphic_prizes <- -log10((dimorphic_specific$Male_FDR + dimorphic_specific$Female_FDR) / 2)
names(dimorphic_prizes) <-dimorphic_specific$Gene_Symbols

# Construct the PCSF subnetwork
dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)

6.2 Visualizing the Network

We use plot_network() to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.

if (!is.null(dimorphic_network) && igraph::vcount(dimorphic_network) > 0) {
  #Generate network visualization
  plot_network(dimorphic_network, "Cell type 1", "sex-dimorphic", result_one)
} else {
  cat("No network could be constructed for this category.")
}

7 Generating a Report

The generate_cat_report function can be used to compile all results into a single HTML report.

# This command generates a comprehensive HTML report
network_results  <- lapply(network_results , function(cell) {
  Filter(Negate(is.null), cell)
})


generate_cat_report(result_categories, pathway_category, network_results)

Single Cell RNA-Seq Example