rsynthbio
is an R package that provides a convenient
interface to the Synthesize
Bio API, allowing users to generate realistic gene expression data
based on specified biological conditions. This package enables
researchers to easily access AI-generated transcriptomic data for
various modalities including bulk RNA-seq and single-cell RNA-seq.
Alternatively, you can AI generate datasets from our web platform.
You can install rsynthbio
from CRAN:
If you want the development version, you can install using the
remotes
package to install from GitHub:
if (!("remotes" %in% installed.packages())) {
install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")
Once installed, load the package:
Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:
# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()
# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)
Loading your API key for a session.
# In future sessions, load the stored token
load_synthesize_token_from_keyring()
# Check if a token is already set
has_synthesize_token()
You can obtain an API token by registering at Synthesize Bio.
For security reasons, remember to clear your token when you’re done:
# Clear token from current session
clear_synthesize_token()
# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)
Never hard-code your token in scripts that will be shared or committed to version control.
The first step to generating AI-generated gene expression data is to create a query. The package provides a sample query that you can modify:
The query consists of:
output_modality
: The type of gene expression data to
generate (see get_valid_modalities
)mode
: The prediction mode (e.g., “mean estimation” or
“sample generation”)inputs
: A list of biological conditions to generate
data forWe train our models with diverse multi-omics datasets. There are two model modes available today:
This result will be a list of two dataframes: metadata
and expression
You can customize the query to fit your specific research needs:
# Adjust number of samples
query$inputs[[1]]$num_samples <- 10
# Add a new condition
query$inputs[[3]] <- list(
metadata = list(
sex = "male",
sample_type = "primary tissue"
),
num_samples = 3
)
The input metadata is a list of lists.
Here are the available metadata fields:
Biological:
age_years
cell_line_ontology_id
cell_type_ontology_id
developmental_stage
disease_ontology_id
ethnicity
genotype
race
sample_type
(“cell line”, “organoid”, “other”, “primary
cells”, “primary tissue”, “xenograft”)sex
(“male”, “female”)tissue_ontology_id
Perturbational:
perturbation_dose
perturbation_ontology_id
perturbation_time
perturbation_type
(“coculture”,“compound”,“control”,“crispr”,“genetic”,“infection”,“other”,“overexpression”,“peptide
or biologic”,“shrna”,“sirna”)Technical:
study
(Bioproject ID)library_selection
(e.g., “cDNA”, “polyA”, “Oligo-dT” -
see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)library_layout
(“PAIRED”, “SINGLE”)platform
(“illumina”)The following are the valid values or expected formats for selected metadata keys:
Metadata Field | Requirement / Example |
---|---|
cell_line_ontology_id |
Requires a Cellosaurus ID. |
cell_type_ontology_id |
Requires a CL ID. |
disease_ontology_id |
Requires a MONDO ID. |
perturbation_ontology_id |
Must be a valid Ensembl gene ID (e.g.,
ENSG00000156127 ), ChEBI ID (e.g.,
CHEBI:16681 ), ChEMBL ID (e.g.,
CHEMBL1234567 ), or NCBI Taxonomy ID (e.g.,
9606 ). |
tissue_ontology_id |
Requires a UBERON ID. |
To lookup ontology terms, we recommend using the EMBL-EBI Ontology Lookup Service.
Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.
Once your query is ready, you can send it to the API to generate gene expression data.
If you want the full API response beyond just than just the result of
the metadata and expression returned put
raw_response = TRUE
.