SIMMS enables intergration of molecular profiles with functional networks such as protein-protein interaction networks. This document shows how to use SIMMS from a simple use case of integrating mRNA abundance data with pathways derived protein-protein networks, through to more sophisticated use cases of integrating multiple molecular profiles such as mRNA abundance, DNA copy-number and DNA-methylation data. Note, SIMMS requires patient outcome data as dependent variable. In current implementation, it works with survival data Survival time, Survival status
. Please setup the dataset metadata directories before using SIMMS.
There are a couple of directories that one needs to setup and ensure the contents are in correct format. These directories contain:
The structure and format of these inputs is described below.
SIMMS rely on functionally or computationally derived networks in order to meaningfully integrate molecular profiles. By default, SIMMS package has a built in database of pathway derived protein-protein interaction networks extracted from (a dated version) Reactome, BioCarta and NCI-PID. This database can be accessed through various SIMMS functions using the parameter setting networks.database = "default"
. If you would like to create your own networks database for subsequent use with SIMMS, please note that the networks database is organised in two files:
pathway_based_sub_networks.txt
pathway_based_networks_flattened.txt
File pathway_based_sub_networks.txt
contains all the subnetworks. For instance a hypothetical subnetwork of five genes ERBB2, EGFR, MKI67, ESR1, PGR
with four interactions (PGR-ESR1, EGFR-ERBB2, EGFR-PGR, MKI67-ESR1)
, and another hypothetical subnetwork of three genes PDK1, AKT3 and PIK3CA with two interactions (PDK1-PIK3CA, AKT3-PIK3CA)
using Entrez Gene IDs would look like (tab-delimited):
#ID=0001 NAME=Module.1
ID=0001 5241 2099
ID=0001 1956 2064
ID=0001 1956 5241
ID=0001 4288 2099
#ID=0002 NAME=Module.2
ID=0002 5163 5290
ID=0002 10000 5290
Please note that the only Entrez IDs are supported in this file, which should match the Entrez IDs format in the molecular data files (except additional _at suffix in the molecular data files).
File pathway_based_networks_flattened.txt
is a non-redundant, without self-interactions, quick lookup table of all the pairwise interactions present in the file pathway_based_sub_networks.txt
. This should not contain the subnetwork name and ID annotations. For example, the contents of this file for the above-mentioned subnetworks would look like (tab-delimited):
GeneID1 GeneID2
5241 2099
1956 2064
1956 5241
4288 2099
5163 5290
10000 5290
Please note that only Entrez IDs are supported in this file, which should match the Entrez IDs format in the molecular data files (except additional _at suffix in the molecular data files). If you have a gene list without known interactions, and would like to try Node-only model (model 2), you can create a subnetwork with random interactions in pathway_based_sub_networks.txt
as long as all the features (genes) are present and connected. Same goes for pathway_based_networks_flattened.txt
.
The two database files are placed inside a directory you wish to call your database. For example: MyNetworkDB
, and this directory should be placed under:
SIMMS/inst/programdata/networkdb/
You would need to build and install SIMMS package again and your networks database should be available to use through relevant SIMMS’ functions with parameter setting networks.database = "MyNetworkDB"
The main input to SIMMS is molecular and clinical data. Currently tested datatypes are mRNA abundance, DNA copy-number and DNA methylation datasets. mRNA abundance and DNA methylation profiles are expected to be in continuous scale, while DNA copy-number profiles are expected to be gene copy-number calls (-2,-1,0,1,2) or (-1,0,1)
representing deletions (-ve), neutral (0), and gains/amplifications (+ve) calls. Given that SIMMS support both continuous and categorical/ordinal profiles, any datatype can be used as input. The table below shows the supported/unsupported datatypes and respective examples:
Data type | Example data | Example profile |
---|---|---|
{ x ∈ ℝ : x ≥ 0 } | 0, 2, 1.37, 7.04, 9.68 | log2 mRNA or miRNA abundance |
{ x ∈ ℝ : x ≥ 0 } | 0.1, 0.38, 0.78, 0.22, 0.98 | DNA methylation beta values |
{ x ∈ ℤ } | -2, -1, 0, 1, 2 | copy-number calls. Reference group = 0 (Neutral/Diploid) |
{ x ∈ ℤ : x ≥ 0 } | 0, 1, 2, 3 | mutation data. Reference group = 0 (Wildtype) |
{ x ∈ ℤ : x ≠ 0 } | -2, -1, 1, 2 | Unsupported due to missing reference group 0 |
{ x ∉ ℝ } | WT, Mutant, Gain, Deleted | Unsupported due to alphabets |
Molecular profiles are strictly tab-delimited, and are expected to be in feature (gene) by sample (patient) matrices. Genes should be represented using Entrez IDs followed by _at suffix. For mRNA abundance data, if you have pre-processed your data with BrainArray Entrez CDFs Entrez CDF, your data should be SIMMS compatible by default. Otherwise, please map your dataset features e.g genes to Entrez IDs followed by _at suffix (e.g 5290_at).
Clinical profiles are strictly tab-delimited, sample (patient) by annotation matrices. The annotation columns must contain survival columns Survival time, Survival status, Survival time unit
. The row names should match the column names in the molecular profiles’ matrices. These columns in clinical annotation file could be called anything as long as these are correctly identified through the metadata file described below.
Both molecular and clinical annotations are read by SIMMS through a tab-delimited metadata file called datasets.txt
. An example of the contents of this file are shown below:
dataset mRNA cna methylation annotations survstat survtime survtime.unit
Breastdata1 mRNA_abundance.txt CNA.txt DNA_methylation.txt patient_annotation.txt e.os t.os t.os.unit
Breastdata2 mRNA_abundance.txt CNA.txt DNA_methylation.txt patient_annotation.txt e.os t.os t.os.unit
Here, each row contain an entry for a dataset e.g Breastdata1 and Breastdata2, each having its own directory on the filesystem. mRNA, cna and methylation columns contain the file names of each of the these datatypes. annotations column contains the names of the annotation files for each dataset. All the datatypes and annotation files for a given dataset must be inside the dataset directory (e.g Breastdata1/). survstat, survtime, and survtime.unit columns contain the column names of Survival status, Survival time and Survival time unit
, respectively. These column names are expected to match the column names in the annotations files. annotation data file must also have a column Tumour
with possible values of Yes|No
. All analyses will be limited to those samples with Yes
in the Tumour
column. Further subsetting of molecular and annotation datasets is enable using parameter subset
in SIMMS’ functions ?derive.network.features
& ?prepare.training.validation.datasets
. For convenience sake, an example test dataset testdata
containing metadata file datasets.txt
, mRNA abundance profiles and respective clinical data is bundled with SIMMS package, and can be found under:
SIMMS/inst/programdata/testdata/
There is no need to drop your datasets inside the package, or need to rebuild the package. You can just point to your datasets through SIMMS package keeping them anywhere on your filesystem.
Lets start with a simple case. We have two mRNA abundance datasets; Breastdata1, Breastdata2
. We would like to identify prognostic biomarkers using Breastdata1 and validate on Breastdata2. Assuming these two datasets are setup correctly (as described in the previous section), a typical workflow would be:
options("warn" = -1);
# load SIMMS library
library("SIMMS");
# path of the data directory containing Breastdata1/ and Breastdata2/ subdirectories
<- SIMMS::get.program.defaults(networks.database = "test")[["test.data.dir"]];
data.directory
# path of the directory where results will be stored
<- tempdir();
output.directory
# molecular profiles to be used
<- c("mRNA");
data.types
# feature selection datasets
# name of the dataset directory containing mRNA abundance and annotation profiles of training dataset/s
<- c("Breastdata1");
feature.selection.datasets
# model training datasets, ideally same as feature selection datasets
# name of the dataset directory containing mRNA abundance and annotation profiles of training dataset/s
<- feature.selection.datasets;
training.datasets
# validation datasets
# name of the dataset directory containing mRNA abundance and annotation profiles of validation dataset/s
<- c("Breastdata2");
validation.datasets
# all the possible P value thresholds that one may consider applying to feature selection process.
# its the P value of univariate (genewise) Cox model statistics
<- c(0.5);
feature.selection.p.thresholds
# one of the P values above, to be used for subsequent analysis. Not a vector for performance reasons
<- 0.5;
feature.selection.p.threshold
# names of the learning algorithms to be used for the final multivarite model
<- c("backward", "forward", "glm", "randomforest");
learning.algorithms
# top features to be used for model selection (Backwards elimination, Forward selection, GLM, Random survival forest)
# you can try a number of different model selection runs by specifying a vector of top n features
<- c(5);
top.n.features
# truncate survival
<- 10;
truncate.survival
# calculate per feature univariate coefficients in training sets
derive.network.features(
data.directory = data.directory,
output.directory = output.directory,
data.types = data.types,
feature.selection.datasets = feature.selection.datasets,
feature.selection.p.thresholds = feature.selection.p.thresholds,
networks.database = "test", # or "default" for Reactome/BioCarta/NCI-PID
truncate.survival = truncate.survival
);
# calculate per-subnetwork scores in both training and validation sets
prepare.training.validation.datasets(
data.directory = data.directory,
output.directory = output.directory,
data.types = data.types,
p.threshold = feature.selection.p.threshold,
feature.selection.datasets = feature.selection.datasets,
datasets = c(training.datasets, validation.datasets),
networks.database = "test", # or "default" for Reactome/BioCarta/NCI-PID
truncate.survival = truncate.survival
);
# iterate over varying top n features, identify and validate survival models
for (top.n in top.n.features) {
# create classifier assessing univariate prognostic power of subnetwork modules (Train and Validate)
<- create.classifier.univariate(
ret data.directory = data.directory,
output.directory = output.directory,
feature.selection.datasets = feature.selection.datasets,
feature.selection.p.threshold = feature.selection.p.threshold,
training.datasets = training.datasets,
validation.datasets = validation.datasets,
top.n.features = top.n
);
# create a multivariate classifier (Train and Validate)
<- create.classifier.multivariate(
ret data.directory = data.directory,
output.directory = output.directory,
feature.selection.datasets = feature.selection.datasets,
feature.selection.p.threshold = feature.selection.p.threshold,
training.datasets = training.datasets,
validation.datasets = validation.datasets,
learning.algorithms = learning.algorithms,
top.n.features = top.n
);
# perform Kaplan-Meier analysis and generate plots
create.survivalplots(
data.directory = data.directory,
output.directory = output.directory,
training.datasets = training.datasets,
validation.datasets = validation.datasets,
top.n.features = top.n,
learning.algorithms = learning.algorithms,
truncate.survival = truncate.survival,
survtime.cutoffs = c(5),
main.title = FALSE,
KM.plotting.fun = "create.KM.plot",
resolution = 100
); }
Please note that a number of parameters such as output.directory, training.datasets and validations.datasets
are repeatedly passed in various methods. This may look mindless, however, this is because none of the methods return objects that are passed over to the next method/s. The rationale behind this was to keep the memory footprint to bare-minimum by avoiding large R objects passed around. This particular feature also facilitates restarting from any step after deliberate or accidental closure of R environment as everything can be read again from the filesystem. The only compromise is the data footprint on the filesystem.
The above R-script will generate two sub-directories under the output.directory
path:
output/ and graphs/
SIMMS creates two output directories; output/ and graphs/
. The contents of these directories are described below:
coxph_nodes__
$training.datasets
__datatype_$data.types
.txt coxph_edges_coef__$training.datasets
__datatype_$data.types
.txt coxph_edges_P__$training.datasets
__datatype_$data.types
.txt
Contains per feature (nodes) univariate Cox proportional hazards model results, and per interaction (gene-gene edge) Cox proportional hazards model results
top_subnets_score__TRAINING_
$training.datasets
__model_1,2,3
__PV_$feature.selection.p.thresholds
.txt
Contains per subnetwork scores and is used for subsequent feature selection. Models 1, 2 and 3 refers to Node+Interaction, Node-only and Interaction-only models. Nodel-only (Model-2) is generally the most interpretable and robust model
riskscores_uv__
all_training,all_validation
__TRAINING_$training.datasets
__model_1,2,3
__top_$top.n
.txt
Contains per sample risk scores for each subnetwork, along with Survival time and Survival status. Sample by risk score matrix. First two columns contain survival data
riskgroups_uv__
all_training,all_validation
__TRAINING_$training.datasets
__model_1,2,3
__top_$top.n
.txt
Contains per sample risk groups for each subnetwork, along with Survival time and Survival status. Sample by risk group matrix. First two columns contain survival data. Risk groups are median-dichotomised (training set) risk scores
riskscores__
all_training,all_validation
__TRAINING_$training.datasets
__backward,forward,glm,randomforest
__model_1,2,3
__top_$top.n
.txt
Contains per sample risk scores estimated by the multivariate Cox proportional hazards model, along with Survival time and Survival status
riskgroups__
all_training,all_validation
__TRAINING_$training.datasets
__backward,forward,glm,randomforest
__model_1,2,3
__top_$top.n
.txt
Contains per sample risk group (along with Survival time and Survival status) generated by median dichotomising risk scores (training set) estimated through the multivariate Cox proportional hazards model
coxph__
all_training,all_validation
__TRAINING_$training.datasets
__backward,forward,glm,randomforest
__model_1,2,3
__top_$top.n
.txt
Contains univariate Cox proportional hazards model results with risk group (derived from multivariate Cox model) as the explanatory variable
beta__TRAINING_
$training.datasets
__backward,forward
__model_1,2,3
__top_$top.n
.txt
Contains the fitted coefficients (betas) of the final model following backward elimination and forward selection (separately)
beta__TRAINING_
$training.datasets
__glm
__model_1,2,3
__top_$top.n
.txt
Contains the fitted coefficients (betas) of the final model following GLMs (LASSO, Ridge or Elastic Nets) driven feature selection
vimp__TRAINING_
$training.datasets
__randomforest
__model_1,2,3
__top_$top.n
.txt
Contains the variable importance of random forest.
OOB_error__TRAINING_
$training.datasets
__randomforest
__model_1,2,3
__top_$top.n
.txt
Contains OOB error rate against number of trees to help identify stablisation point for ‘rf.ntree’ parameter
KM_uv__
all_training,all_validation
__TRAINING_$training.datasets
__model_1,2,3
__SubnetworkName
__truncated_$truncate.survival
.png
Kaplan-Meier analysis of risk groups generated for each subnetwork through univariate Cox proportional hazards model. These will be generated if parameter plot.univariate.data
was set to TRUE
in create.survivalplots()
as default value is set to FALSE
.
KM__
all_training,all_validation
__TRAINING_$traning.datasets
__backward,forward,glm,randomforest
__model_1,2,3
__top_$top.n
__truncated_$truncate.survival
.png
Kaplan-Meier analysis of risk groups generated through multivariate Cox proportional hazards model