Module `paroxython.map_taxonomy`

Translate labels into taxa.

Description

This step comes after paroxython.label_programs has tagged a program with the labels specified in spec.md. Its purpose is to convert these labels into taxa based on the rules defined (by default) in taxonomy.tsv.

Example

Let's return to the Jupyter notebook's cell given in the introduction:

1   %%paroxython labels
2   def fibonacci(n):
3       result = []
4       (a, b) = (0, 1)
5       while a < n:
6           result.append(a)
7           (a, b) = (b, a + b)
8       return result

Note that the magic cell-command %%paroxython is now called with a labels argument. This produces the following output (limited for brevity to its first four rows):

Label	Lines
`addition_operator`	7
`assignment`	3, 4, 7
`assignment_lhs_identifier:a`	4, 7
`assignment_lhs_identifier:b`	4, 7

These labels are low-level tags, intended for internal use only. They still need to be translated into the following taxa, which are the only tags of interest to the end user. For instance, "addition_operator" will be translated into "operator/arithmetic/addition" (note that, unlike a label, a taxon may feature one or several slashes, which indicate the nesting of notions). More examples are given in the dedicated section of the user manual.

In fact, both labels and taxa are couples:

Label: ("addition_operator", [(7, 7, "...")]).
Taxon: ("operator/arithmetic/addition", {(7, 7, "..."): 1}).

The second member is slightly more complicated than a list of line numbers:

For a label, it is a list of triples consisting of the start and end of the spanning lines, and a third member, the path (here denoted by the ellipsis), which identifies unambiguously the start of the span.
For a taxon, it is a bag, i.e. a dictionary associating such triples with the count of their occurrences in the program.

The paths being left untouched by the conversion, we will omit them from now on.

Deduplication

Labels:
- ("assignment", [(3, 3), (4, 4), (7, 7)]).
- ("single_assignment:result", [(3, 3)]).
- ("parallel_assignment", [(4, 4), (7, 7)]).
- ("slide", [(7, 7)]).
Taxa:
- ("var/assignment/explicit/single", {(3, 3): 1}).
- ("var/assignment/explicit/parallel", {(4, 4): 1}).
- ("var/assignment/explicit/parallel/slide", {(7, 7): 1}).

The example program features three distinct assignments. They can be categorized into:

A single assignment on line 3 (result = []).
A parallel assignment on line 4 ((a, b) = (0, 1)).
A parallel assignment where a couple is updated by “sliding” the value of its second member to the first position, on line 7 ((a, b) = (b, a + b)).

As you can see, labelling all these produces a lot of span redundancies. For instance, the assignment of line 7 is labelled three times, as "assignment", "parallel_assignment" and "slide". Since the labels are not structured, this is the best that we can do.

Enter the taxa, whose production is triggered by the following taxonomy rows:

Taxa (replacement patterns)	Labels (search patterns)
`var/assignment/explicit`	`assignment\b.*`
`var/assignment/explicit/parallel`	`parallel_assignment:\d+`
`var/assignment/explicit/parallel/slide`	`slide`
`var/assignment/explicit/single`	`single_assignment:.+`

Initially, the conversion produces as many redundant spans as for the labels:

Taxa:
- ("var/assignment/explicit", {(3, 3): 1, (4, 4): 1, (7, 7): 1}).
- ("var/assignment/explicit/single", {(3, 3): 1}).
- ("var/assignment/explicit/parallel", {(4, 4): 1, (7, 7): 1}).
- ("var/assignment/explicit/parallel/slide", {(7, 7): 1}).

However, because of the tree-like structuring of taxa, it is possible to remove those occurrences that are subject to further characterization (e.g., var/assignment/explicit/parallel on line 7 is further characterized as var/assignment/explicit/parallel/slide), resulting in the list of taxa given at the start of the section.

Note that the removal of spans has finally led to the elimination of the overly generic taxon var/assignment/explicit.

Expand source code Browse GitHub

from collections import Counter, defaultdict
from pathlib import Path
from typing import Callable, Dict, List, Optional, Tuple
from typing import Pattern as RegexPattern
from os.path import commonpath, dirname
from functools import lru_cache

import regex  # type: ignore

from .user_types import (
    LabelName,
    LabelPattern,
    Labels,
    Taxon,
    TaxonName,
    TaxonNames,
    TaxonPattern,
    Taxa,
    TaxaSpans,
)

__pdoc__ = {
    "Taxonomy.__init__": True,
    "Taxonomy": "",
}


class Taxonomy:
    def __init__(
        self,
        taxonomy_path: Optional[Path] = None,
        **kwargs,
    ) -> None:
        taxonomy_path = taxonomy_path or Path(dirname(__file__)) / "resources" / "taxonomy.tsv"
        tsv = taxonomy_path.read_text().partition("-- EOF")[0].strip()
        self.literal_labels: Dict[LabelName, TaxonNames] = defaultdict(list)
        self.compiled_labels: List[Tuple[RegexPattern, TaxonPattern]] = []
        for line in sorted(tsv.split("\n")[1:]):
            (taxon_value, label_value, *_) = line.strip().split(maxsplit=2)
            (taxon_pattern, label_pattern) = (TaxonPattern(taxon_value), LabelPattern(label_value))
            if is_literal(label_pattern):
                self.literal_labels[LabelName(label_pattern)].append(TaxonName(taxon_pattern))
            else:
                self.compiled_labels.append((regex.compile(f"{label_pattern}$"), taxon_pattern))
                # note: "$" is necessary: regex.fullmatch() has no regex.fullsub() counterpart

    @lru_cache(maxsize=None)
    def get_taxon_name_list(
        self, label_name: LabelName, looks_like_a_taxon: Callable = regex.compile(r"^\w+/.+$").match
    ) -> TaxonNames:
        if looks_like_a_taxon(label_name):
            return [TaxonName(label_name)]
        result: TaxonNames = self.literal_labels.get(label_name, [])
        for (label_regex_pattern, taxon_pattern) in self.compiled_labels:
            if label_regex_pattern.match(label_name):
                result.append(TaxonName(label_regex_pattern.sub(taxon_pattern, label_name)))
        return result

    def to_taxa(self, labels: Labels) -> Taxa:
        acc: TaxaSpans = defaultdict(Counter)
        for (label_name, spans) in labels:
            for taxon_name in self.get_taxon_name_list(label_name):
                acc[taxon_name].update(spans)
        taxa = [Taxon(name, spans) for (name, spans) in sorted(acc.items())]
        return deduplicated_taxa(taxa)


def is_literal(label_pattern: LabelPattern) -> bool:
    return label_pattern.replace(".", "\\.") == regex.escape(label_pattern)


def deduplicated_taxa(taxa: Taxa) -> Taxa:
    if len(taxa) < 2:
        return taxa
    for (i, (name, spans)) in enumerate(taxa[1:], 1):
        for (previous_name, previous_spans) in reversed(taxa[:i]):
            # Since `"/"` happens to be a separator in both taxa and Unix-paths, `os.commonpath()`
            # comes in handy for extracting the common prefix of two taxa:
            common_prefix = commonpath((name, previous_name))
            if not common_prefix:
                break
            if previous_name == common_prefix:
                difference = spans - previous_spans
                previous_spans.subtract(spans)
                spans = difference
    result = []
    for (name, spans) in taxa:
        spans += Counter()  # suppress items whose count <= 0
        if spans:  # if any item remains in the bag
            result.append(Taxon(name, spans))
    return result


if __name__ == "__main__":
    from .goodies import couple_to_string
    from .label_programs import labelled_programs

    taxonomy = Taxonomy()
    for program in labelled_programs(Path("examples/simple/programs")):
        taxa = taxonomy.to_taxa(program.labels)
        if not taxa:
            continue
        width = min(40, max(len(" ".join(map(str, taxon.spans))) for taxon in taxa))
        for (name, spans) in taxa:
            span_string = " ".join(map(couple_to_string, sorted(set(spans))))
            print(f"{span_string:>{width}}\t{name}")

Functions

def is_literal(label_pattern: LabelPattern) ‑> bool

Tell whether a given regular expression pattern is literal, disregarding the dots.

Description

All label patterns of the taxonomy TSV file (second column) will be tested successively. For instance, the label pattern "external_free_call:print" is literal, and then matches only itself. Conversely, the label pattern "internal_free_call:[[:upper:]].*" contains some non-literal characters such as "[" and "*", and then must be compiled into a regular expression. A pattern is considered to be literal if it is equal to its escaped form, not taking the dots into account.

Note

In case not considering the dot as a metacharacter is a problem, it is always possible to force the interpretation of a pattern as non-literal by enclosing it in parentheses.

Expand source code Browse GitHub

def is_literal(label_pattern: LabelPattern) -> bool:
    return label_pattern.replace(".", "\\.") == regex.escape(label_pattern)

def deduplicated_taxa(taxa: List[Taxon]) ‑> List[Taxon]

Remove from the given taxa those that are counted multiple times as prefixes of another ones.

Description

Consider the following program:

1   while n != 1:
2       if n % 2 == 0:
3           n = n // 2
4       else:
5           n = 3 * n + 1

It features exactly one literal integer on lines 1 and 3, and exactly two on lines 2 and 5. Moreover, one of the literals of line 2 is zero. Therefore, the program is originally tagged with the following two couples (among others):

[
    ...
    ("type/number/integer/literal",
        {
            (1, 1): 1,
            (2, 2): 2,
            (3, 3): 1,
            (5, 5): 2,
        }
    ),
    ("type/number/integer/literal/zero", {(2, 2): 1}),
    ...
]

Tip

In the dictionary above, (5, 5): 2 means that the taxon "type/number/integer/literal" has 2 distinct occurrences spanning the lines 5 to 5 (namely, 3 and 1). Such a “counter” dictionary is called a bag, or multiset.

Now, every occurrence of a given taxon is an occurrence of any of its prefix. For instance, an occurrence of the literal integer 0 is obviously an occurrence of a literal integer. Currently, it is counted twice. The purpose of this function is to remove such redundant counts. The longest taxon will be kept, and its prefix count decreased. Finally, the following deduplicated list will be returned:

[
    ...
    ("type/number/integer/literal",
        {
            (1, 1): 1,
            (2, 2): 1, # decremented
            (3, 3): 1,
            (5, 5): 2,
        }
    ),
    ("type/number/integer/literal/zero", {(2, 2): 1}),
    ...
]

Decrementing a span count may result in the deletion of a taxon. For instance, on the same program:

[
    ...
    ("flow/loop", {(1, 5): 1},
    ("flow/loop/exit/late", {(1, 5): 1},
    ("flow/loop/while", {(1, 5): 1},
    ...
]

… becomes after deduplication:

[
    ...
    ("flow/loop/exit/late", {(1, 5): 1},
    ("flow/loop/while", {(1, 5): 1},
    ...
]

Args

taxa : Taxa: A list of couples, sorted lexicographically and consisting of a taxon name and a bag (multiset) of its spans.

Returns

Taxa: The same list, deduplicated.

Expand source code Browse GitHub

def deduplicated_taxa(taxa: Taxa) -> Taxa:
    if len(taxa) < 2:
        return taxa
    for (i, (name, spans)) in enumerate(taxa[1:], 1):
        for (previous_name, previous_spans) in reversed(taxa[:i]):
            # Since `"/"` happens to be a separator in both taxa and Unix-paths, `os.commonpath()`
            # comes in handy for extracting the common prefix of two taxa:
            common_prefix = commonpath((name, previous_name))
            if not common_prefix:
                break
            if previous_name == common_prefix:
                difference = spans - previous_spans
                previous_spans.subtract(spans)
                spans = difference
    result = []
    for (name, spans) in taxa:
        spans += Counter()  # suppress items whose count <= 0
        if spans:  # if any item remains in the bag
            result.append(Taxon(name, spans))
    return result

Classes

class Taxonomy

Expand source code Browse GitHub

class Taxonomy:
    def __init__(
        self,
        taxonomy_path: Optional[Path] = None,
        **kwargs,
    ) -> None:
        taxonomy_path = taxonomy_path or Path(dirname(__file__)) / "resources" / "taxonomy.tsv"
        tsv = taxonomy_path.read_text().partition("-- EOF")[0].strip()
        self.literal_labels: Dict[LabelName, TaxonNames] = defaultdict(list)
        self.compiled_labels: List[Tuple[RegexPattern, TaxonPattern]] = []
        for line in sorted(tsv.split("\n")[1:]):
            (taxon_value, label_value, *_) = line.strip().split(maxsplit=2)
            (taxon_pattern, label_pattern) = (TaxonPattern(taxon_value), LabelPattern(label_value))
            if is_literal(label_pattern):
                self.literal_labels[LabelName(label_pattern)].append(TaxonName(taxon_pattern))
            else:
                self.compiled_labels.append((regex.compile(f"{label_pattern}$"), taxon_pattern))
                # note: "$" is necessary: regex.fullmatch() has no regex.fullsub() counterpart

    @lru_cache(maxsize=None)
    def get_taxon_name_list(
        self, label_name: LabelName, looks_like_a_taxon: Callable = regex.compile(r"^\w+/.+$").match
    ) -> TaxonNames:
        if looks_like_a_taxon(label_name):
            return [TaxonName(label_name)]
        result: TaxonNames = self.literal_labels.get(label_name, [])
        for (label_regex_pattern, taxon_pattern) in self.compiled_labels:
            if label_regex_pattern.match(label_name):
                result.append(TaxonName(label_regex_pattern.sub(taxon_pattern, label_name)))
        return result

    def to_taxa(self, labels: Labels) -> Taxa:
        acc: TaxaSpans = defaultdict(Counter)
        for (label_name, spans) in labels:
            for taxon_name in self.get_taxon_name_list(label_name):
                acc[taxon_name].update(spans)
        taxa = [Taxon(name, spans) for (name, spans) in sorted(acc.items())]
        return deduplicated_taxa(taxa)

Methods

def __init__(self, taxonomy_path: Union[pathlib.Path, NoneType] = None, **kwargs ) ‑> NoneType

Read and pre-process the taxonomy specifications.

Args

taxonomy_path : Optional[Path], optional: The path of a two-columns TSV file associating label (search) patterns with taxon (replacement) patterns. For better readability, the taxa are listed on the first column, and the corresponding labels (sometimes very long) on the second column. If not specified, the default taxonomy is used. Defaults to None.

Description

Read the taxonomy, falling back to the provided default taxonomy.tsv.
Drop the part starting with the string "-- EOF", if any (this pre-processing allows you to leave notes or drafts at the end of the file).
Distribute the remaining rows into two different accumulators, depending on the “literality” of the label pattern:
- self.literal_labels is a dictionary associating every literal label pattern, that it to say, a label name, with a list of taxon names. Most of these lists are singletons, but one can imagine that a unique label name could be associated with multiple taxon names. As a dictionary, it allows O(1) lookups of any given label pattern.
- self.compiled_labels is a list of couples, of which the first member is a compiled label pattern, and the second member is a replacement pattern which may contain backreferences to one or several capture groups of the label pattern. For instance, it can include the couple (regex.compile(r"index_shape:(\d+)$"), r"subscript/index/shape/\1"), which will be used to “translate” any occurrence of the label name "index_shape:3" into the taxon name "subscript/index/shape/3". As an associative list, it only allows O(n) lookups a given label pattern.

Expand source code Browse GitHub

def __init__(
    self,
    taxonomy_path: Optional[Path] = None,
    **kwargs,
) -> None:
    taxonomy_path = taxonomy_path or Path(dirname(__file__)) / "resources" / "taxonomy.tsv"
    tsv = taxonomy_path.read_text().partition("-- EOF")[0].strip()
    self.literal_labels: Dict[LabelName, TaxonNames] = defaultdict(list)
    self.compiled_labels: List[Tuple[RegexPattern, TaxonPattern]] = []
    for line in sorted(tsv.split("\n")[1:]):
        (taxon_value, label_value, *_) = line.strip().split(maxsplit=2)
        (taxon_pattern, label_pattern) = (TaxonPattern(taxon_value), LabelPattern(label_value))
        if is_literal(label_pattern):
            self.literal_labels[LabelName(label_pattern)].append(TaxonName(taxon_pattern))
        else:
            self.compiled_labels.append((regex.compile(f"{label_pattern}$"), taxon_pattern))

def get_taxon_name_list(self, label_name: LabelName, looks_like_a_taxon: Callable = <built-in method match of regex> ) ‑> List[TaxonName]

Translate a label name into a list of taxon names.

Description

First of all, when a label looks like a taxon, it is returned without further ado.

For the rest, most of the work was done during the initialization, by constructing the map self.literal_labels and the list self.compiled_labels. The given label name is first looked up in the map, then matched successively against every regular expression stored in the list.

The second step may be useful even if the first one has already produced one or several translations. For instance, take a taxonomy defining the following associations:

Taxa (replacement patterns)	Labels (search patterns)
`call/subroutine/builtin/casting/\1`	`free_call:(list\|dict)`
`type/sequence/list`	`free_call:list`

Suppose now that we pass the label "free_call:list". The lookup in the map produces a first translation: "type/sequence/list". However, the linear search in the list produces a second translation, namely: "call/subroutine/builtin/casting/list".

>>> a_taxonomy_instance.get_taxon_name_list("free_call:list")
[
    "type/sequence/list",
    "call/subroutine/builtin/casting/list",
]

Therefore, the complexity of this function is linear in every case. However, its memoization avoids having to translate the same label more than one time (cf. lru_cache).

Expand source code Browse GitHub

@lru_cache(maxsize=None)
def get_taxon_name_list(
    self, label_name: LabelName, looks_like_a_taxon: Callable = regex.compile(r"^\w+/.+$").match
) -> TaxonNames:
    if looks_like_a_taxon(label_name):
        return [TaxonName(label_name)]
    result: TaxonNames = self.literal_labels.get(label_name, [])
    for (label_regex_pattern, taxon_pattern) in self.compiled_labels:
        if label_regex_pattern.match(label_name):
            result.append(TaxonName(label_regex_pattern.sub(taxon_pattern, label_name)))
    return result

def to_taxa(self, labels: List[Label] ) ‑> List[Taxon]

Translate a list of labels to a list of taxa.

Args

labels : Labels: A list of labels, each label consisting in a name and a list of spans.

Returns

Taxa: A sorted list of taxa, each taxon consisting in a name and a bag of spans.

Description

This simple method is the sole entry point of the object. It takes a list of labels, accumulates their translations into taxa (cf. Taxonomy.get_taxon_name_list()) and returns them deduplicated (cf. deduplicated_taxa()).

Expand source code Browse GitHub

def to_taxa(self, labels: Labels) -> Taxa:
    acc: TaxaSpans = defaultdict(Counter)
    for (label_name, spans) in labels:
        for taxon_name in self.get_taxon_name_list(label_name):
            acc[taxon_name].update(spans)
    taxa = [Taxon(name, spans) for (name, spans) in sorted(acc.items())]
    return deduplicated_taxa(taxa)