#!/usr/bin/python3 """Variants of Russ Cox’s bisect algorithm to find a minimal satisfying subseq (Invoke from the command line with or without ``-easy`` and ``-long`` to demo different variants.) presents an algorithm previously presented in for finding a locally minimal satisfying subsequence Σ of a sequence Π satisfying some predicate that has the property that if a sequence Φ is satisfying, all the longer subsequences of the original sequence Π of which Φ is a subsequence are also satisfying. “Subsequence” here is “subsequence” in the LCS sense, not “substring”: “ktb” is a subsequence of “kitab”, for example, but not a substring. This is useful for automatic debugging because “My program crashes on input Σ” is very often such a predicate, and simple binary search for a single substring of the input is often insufficient, because the necessary input to crash the program is not always a contiguous chunk of the input stream. But it is not limited to that application. This does *not* implement the hash-based bisection method that is the main focus of Cox’s post. It also doesn’t implement the “easy/hard” algorithm. In fact, before I implemented the “easy” algorithm, I implemented how I misremembered the “easy” algorithm, which turns out to be an arguably simpler algorithm that also works. That’s ``bisect_dumb`` below. Here’s the output on a test from Zeller’s _Why Programs Fail_, which this algorithm solves with 21 probes instead of Zeller’s 48:: ☑ ' at 39 ☑ '' Note that, just like Zeller’s `ddmin` algorithm, both of these return a locally minimal sequence, having previously tried each of the possible individual-item removals. """ import re import sys def first_satisfying_bisect(p, start, end): """Return the first i in [start, end) such that p(i), if any. This is in a sense the most basic version of the binary search algorithm. - p: predicate to satisfy. - start: first integer in the candidate range. - end: first integer > start not in the candidate range. Subject to the restriction that if p(j) and k > j, p(k) too. That is, once p goes from false to true, it never goes back to false. Possibly p(i) is false for all values, in which case this algorithm returns end. But it will never call p(end). """ while start < end: # Invariant: the correct return value is in [start, end]. # Variant: end - start strictly diminishes each iteration. mid = start + (end - start) // 2 if p(mid): end = mid # mid is a valid return value; nothing later can be else: start = mid + 1 # mid, and everything earlier, is *not* valid return start def bisect_dumb(p, s, log=lambda *args: None): """Find a locally-minimal satisfying subseq with a dumb algorithm. - p: predicate we are trying to satisfy with a list elements from s. - s: a list that satisfies p. Returns a list of indices into s. This version of the algorithm will suffer a slowdown factor of lg M to find needed items at indices near M. So for example prepending 10000 spaces to the test string below makes it require 120 probes instead of 21. """ assert p(s) end, need = len(s), [] # “need” is Cox’s “forced” set. while True: need.sort() needed = [s[j] for j in need] # identify index of last required element in [0, end) last = first_satisfying_bisect(lambda i: p(s[:i] + needed), 0, end) if last == 0: return need # s[:0] + needed is just needed, and is satisfying end = last - 1 need.append(end) # If s[:7] is needed, we need element 6. log("Found required item", s[end], "at", end) def bisect_easy(p, s, log=lambda *args: None, targets=None, need=[]): """The actual “easy” algorithm from Cox’s post. I translated this from the first Golang listing, but adapted its interface in the process to be compatible with ``bisect_dumb`` above. It returns a sorted list of indices into s. Cox’s version returns the actual items from s rather than the indices. This requires 29 probes into Zeller’s example rather than the 21 required by ``bisect_dumb``, but it is more efficient for long sequences, requiring only 45 instead of ``bisect_dumb``’s 120 when we prepend 10kB of spaces to the test string (``-long``). """ if targets is None: targets = list(range(len(s))) if not targets or p([s[i] for i in sorted(need)]): return [] if len(targets) == 1: return [targets[0]] m = len(targets)//2 left, right = targets[:m], targets[m:] left_reduced = bisect_easy(p, s, log, left, right + need) log("Found on left", [s[i] for i in left_reduced]) # It’s important to use `left_reduced` here rather than just # `left` if we want to make sure our result is actually # satisfying! right_reduced = bisect_easy(p, s, log, right, left_reduced + need) log("Found on right", [s[i] for i in right_reduced]) return left_reduced + right_reduced def trisect_easy(p, s, log=lambda *args: None, targets=None, need=[]): """A trisecting version of Cox’s “easy” algorithm. I have the intuition that this will be consistently more efficient, but tests so far don’t provide strong enough evidence for me to either confirm or reject that hypothesis. """ if targets is None: targets = list(range(len(s))) if not targets or p([s[i] for i in sorted(need)]): return [] if len(targets) == 1: return [targets[0]] m, n = len(targets)//3, 2*len(targets)//3 left, mid, right = targets[:m], targets[m:n], targets[n:] left_reduced = trisect_easy(p, s, log, left, mid + right + need) log("Found on left", [s[i] for i in left_reduced]) mid_reduced = trisect_easy(p, s, log, mid, left_reduced + right + need) log("Found in middle", [s[i] for i in mid_reduced]) right_reduced = bisect_easy(p, s, log, right, left_reduced + mid_reduced + need) log("Found on right", [s[i] for i in right_reduced]) return left_reduced + mid_reduced + right_reduced def example_predicate(s): """ tag. The ``bisect_dumb`` algorithm minimizes the string successfully in 21 trials, and ``bisect_easy`` in 29, instead of the 48 achieved by Zeller’s ``ddmin``. """ return re.compile(r'(?i)').search(''.join(s)) example_string = '