This R package edlibR provides bindings to the C/C++ library edlib, which computes the exact pairwise sequence alignment using the edit distance (Levenshtein distance). The functions within edlibR are modeled after the API of the Python package edlib on PyPI
There are three functions within edlibR:
The first function provided by edlibR is align(). The function align() computes the pairwise alignment of the input query against the input target:
align(query, target, [mode], [task], [k], [cigarFormat], [additionalEqualities])
A list is returned with the following fields:
query and target.list(c(start, end)). Note: if the start or end positions are NULL, this is encoded as NA to work correctly with R vectors.cigarFormat in the function align() which is returned here for the function getNiceAlignment(). (Note: the function getNiceAlignment() only accepts cigarFormat="extended".)library(edlibR)
algn1 = align("ACTG", "CACTRT", mode="HW", task="path")
print(algn1)## $editDistance
## [1] 1
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 3
##
## $locations[[2]]
## [1] 1 4
##
##
## $cigar
## [1] "3=1I"
##
## $cigarFormat
## [1] "extended"
algn2 = align("elephant", "telephone")
print(algn2)## $editDistance
## [1] 3
##
## $alphabetLength
## [1] 8
##
## $locations
## $locations[[1]]
## [1] NA 8
##
##
## $cigar
## NULL
##
## $cigarFormat
## [1] "extended"
algn3 = align("ACTG", "CACTRT", mode="HW", task="path")
print(algn3)## $editDistance
## [1] 1
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 3
##
## $locations[[2]]
## [1] 1 4
##
##
## $cigar
## [1] "3=1I"
##
## $cigarFormat
## [1] "extended"
## the previous example with additionalEqualities
algn4 = align("ACTG", "CACTRT", mode="HW", task="path", additionalEqualities=list(c("R", "A"), c("R", "G")))
print(algn4)## $editDistance
## [1] 0
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 4
##
##
## $cigar
## [1] "4="
##
## $cigarFormat
## [1] "extended"
edlibR:
AACT and target as AACTGGC, the edit distance would be 0, because removing GGC from the end of the second sequence is “free” and does not count into the total edit distance. This method is appropriate when you want to find out how well the first sequence fits at the beginning of the second sequence.ACT and CGACTGAC, the edit distance would be 0, because removing CG from the start and GAC from the end of the second sequence is “free” and does not count into the total edit distance. This method is appropriate when you want to find out how well the first sequence fits at any part of the second sequence. For example, if your second sequence was a long text and your first sequence was a sentence from that text, but slightly scrambled, you could use this method to discover how scrambled it is and where it fits in that text. In bioinformatics, this method is appropriate for aligning a read to a sequence.cigarFormat="extended"):
The function getNiceAlignment() takes the output of align(), and represents this in a visually informative format for human inspection (“NICE format”). This will be an informative string showing the matches, mismatches, insertions, and deletions.
getNiceAlignment(alignResult, query, target, [gapSymbol])
Note: Users must use the argument task="path" within align() to output a CIGAR for getNiceAlignment(); otherwise, there will be no CIGAR for getNiceAlignment() to reconstruct the alignment in “NICE” format. Also, users must use the argument cigarFormat="extended" within align(); otherwise, the CIGAR will be too ambiguous for getNiceAlignment() to correctly reconstruct the alignment() in “NICE” format.
library(edlibR)
query = "elephant"
target = "telephone"
result = align(query, target, task = "path")
nice_algn = getNiceAlignment(result, query, target)
print(nice_algn)## $query_aligned
## [1] "-elephant"
##
## $matched_aligned
## [1] "-|||||.|."
##
## $target_aligned
## [1] "telephone"
align(). As mentioned above, align() must use the arguments task="path" and cigarFormat="extended" in order for the CIGAR to be informative enough for getNiceAlignment() to work properly.alignResultalignResultquery and target (default="-"). This must be a single character, i.e. a string of length 1 (i.e. nchar(gapSymbol) must equal 1).The function nice_print() simply prints the output of getNiceAlignment() to the console for quickly inspecting the alignment. Users can think of this function as a “pretty-print” function for visualization.
library(edlibR)
## example above from getNiceAlignment()
query = "elephant"
target = "telephone"
result = align(query, target, task = "path")
nice_algn = getNiceAlignment(result, query, target)
nice_print(nice_algn)## [1] "query: -elephant"
## [1] "matched: -|||||.|."
## [1] "target: telephone"
For more information regarding edlib, please see the publication in Bioinformatics.