There are several “helper” functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:
one.pattern <- function(pat){
if(is.character(pat)){
pat
}else{
nc::var_args_list(pat)[["pattern"]]
}
}
show.patterns <- function(...){
L <- list(...)
str(lapply(L, one.pattern))
}nc::field for reducing repetitionThe nc::field function can be used to avoid repetition when defining patterns of the form variable: value. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the variable group or output column:
show.patterns(
"variable: (?<variable>.*)", #repetitive regex string
list("variable: ", variable=".*"),#repetitive nc R code
nc::field("variable", ": ", ".*"))#helper function avoids repetition
#> List of 3
#> $ : chr "variable: (?<variable>.*)"
#> $ : chr "(?:variable: (.*))"
#> $ : chr "(?:variable: (?:(.*)))"Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).
Another example:
show.patterns(
"Alignment (?<Alignment>[0-9]+)",
list("Alignment ", Alignment="[0-9]+"),
nc::field("Alignment", " ", "[0-9]+"))
#> List of 3
#> $ : chr "Alignment (?<Alignment>[0-9]+)"
#> $ : chr "(?:Alignment ([0-9]+))"
#> $ : chr "(?:Alignment (?:([0-9]+)))"Another example:
show.patterns(
"Chromosome:\t+(?<Chromosome>.*)",
list("Chromosome:\t+", Chromosome=".*"),
nc::field("Chromosome", ":\t+", ".*"))
#> List of 3
#> $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#> $ : chr "(?:Chromosome:\t+(.*))"
#> $ : chr "(?:Chromosome:\t+(?:(.*)))"nc::quantifier for fewer parenthesesAnother helper function is nc::quantifier which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:
show.patterns(
"(?:-(?<chromEnd>[0-9]+))?", #regex string
list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists
nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
#> List of 3
#> $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#> $ : chr "(?:(?:-([0-9]+))?)"
#> $ : chr "(?:(?:-([0-9]+))?)"Another example with a named capture group inside an optional non-capturing group:
show.patterns(
"(?: (?<name>[^,}]+))?",
list(list(" ", name="[^,}]+"), "?"),
nc::quantifier(" ", name="[^,}]+", "?"))
#> List of 3
#> $ : chr "(?: (?<name>[^,}]+))?"
#> $ : chr "(?:(?: ([^,}]+))?)"
#> $ : chr "(?:(?: ([^,}]+))?)"nc::alternatives for simplified alternationWe also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.
show.patterns(
"(?:(?<first>bar+)|(?<second>fo+))",
list(first="bar+", "|", second="fo+"),
nc::alternatives(first="bar+", second="fo+"))
#> List of 3
#> $ : chr "(?:(?<first>bar+)|(?<second>fo+))"
#> $ : chr "(?:(bar+)|(fo+))"
#> $ : chr "(?:(bar+)|(fo+))"