#!/usr/bin/python3
# -*- coding: utf-8 -*-
"""Simple reverse-mode automatic differentiation in Python.

There are a number of problems for which it’s relatively easy to write
a program to determine whether a given configuration is a solution,
but relatively hard to write a program to find one.  For example:

- find a complex number z for which -6 = 3z⁶ - z⁵ + 4z⁴ + 3z³ + 6z² +
  9z (there’s one close to -0.134714309 + 1.1861122304i, for example,
  and another near -0.697095529 + 0.38106567i, and each of those has a
  conjugate);
- find a non-overlapping spatial arrangement of letters from only a
  few fonts that closely approximates a given page image (predicting,
  say, 99% of its pixels);
- find the solution x to a set of linear equations Ax = b, where A is
  a matrix and x and b are vectors;
- find the distribution of stresses in a given piece of elastic
  material given its structure and material properties, and some
  static loads and boundary conditions;
- find a way to lay out a given JSON data structure within 15 lines
  and 80 columns.

The problem of finding such a solution, given a program to recognize
it, is known as an “inverse problem”, which is an unfortunately vague
name.  There’s a closely related class of problems known, confusingly,
as “programming”, “mathematical programming”, “optimization”, or
“mathematical optimization”, in which the objective is to find the
minimum of some function.  I'm going to call them “minimization
problems”.  The above inverse-problem examples can be in some sense
interchanged with the following optimization problems:

- find a complex number z which minimizes
  |3z⁶ - z⁵ + 4z⁴ + 3z³ + 6z² + 9z + 6|;
- find the non-overlapping spatial arrangement of letters from only a
  few fonts that minimizes the approximation error to a given page
  image;
- find a vector x̂ that minimizes |Ax̂ - b|;
- find a possible distribution of stresses in a given piece of elastic
  material which, given its structure, material properties, loads, and
  boundary conditions, minimizes the accelerations it would be
  experiencing;
- minimize the number of lines to lay out a given JSON data structure
  within and 80 columns.

A note about terminology: the thing you’re trying to find the minimum
of is called the “objective function” or “cost function”; the things
you can change to try to lower it are called the “design variables”.

These problems are equivalent in the following sense.

If you can solve the minimization problem correctly, the solution will
be a solution to the corresponding inverse problem, if it exists; and,
conversely, if you can solve the inverse problem, and it has some kind
of error-tolerance parameter, you can solve it repeatedly with
gradually tightening error bounds to approximate the solution to the
minimization problem to a desired precision.

Reframing an inverse problem as a minimization problem often makes it
easier to solve, because there are a lot of algorithms available for
solving minimization problems.  As I understand it, the algorithms
that work efficiently for high-dimensionality problems (in practice)
require you to be able to not only evaluate the objective function at
different points, but also to evaluate its partial derivatives with
respect to the design variables — that is to say, the gradient of the
objective function with respect to the design variables.

It turns out that any computable function also has a computable
derivative, in a constant factor more computation, to the same
precision, using a procedure known as “automatic differentiation”.
This can be done in “forward mode”, which is very simple to implement,
simply accompanying each intermediate quantity in the computation with
its gradient with respect to the independent variables.  But this is
slow in the case where there are many independent variables; if you
have N independent variables and compute M intermediate quantities,
you have to do N·M computations, N times more than merely computing
the function pointwise.

In the case where you only care about the gradient of one or a few
dependent variables — such as the objective function in a minimization
problem — a much faster alternative is available, called “reverse-mode
automatic differentiation”.  You can propagate the gradient
information *backwards* through the dataflow graph, computing the
loading of the dependent variables on each intermediate quantity in
the reverse order that you computed the quantities in the first place.
This can be many orders of magnitude faster than forward-mode for
minimization problems.

In short, reverse-mode automatic differentiation is a crucial enabling
technology for automatic differentiation; automatic differentiation is
a crucial enabling technology for solving minimization problems; and
solving minimization problems is a crucial enabling technology for
solving inverse problems.

I’m writing this program in order to understand how to implement
reverse-mode automatic differentiation.

What’s implemented so far:

* Operator overloading to construct a dataflow graph
* Optimization of it using random search
* First-order forward-mode automatic differentiation of it
* Optimization of it using gradient descent

Not yet:

- Gradient descent optimization at a reasonable speed
- Gradient descent optimization that works reliably?
- Reverse-mode automatic differentiation

"""
from __future__ import division, print_function

import collections
import random

import numpy


# Dataflow graph construction.


class Vars:
    "Syntactic sugar to define variables to start the dataflow graph with."
    def __init__(self):
        self.assigned = set()

    def __getattr__(self, name):
        if name not in self.assigned:
            self.assigned.add(name)
            return Var(name)

        i = 0
        while True:
            n = '%s%d' % (name, i)
            i += 1
            if n not in self.assigned:
                self.assigned.add(n)
                return Var(n)


some = Vars()


class Node:
    "A node in the dataflow graph; abstract base class."
    def inputs(self):
        "Return the Vars that affect this node, transitively."
        return self.tsort().inputs()

    is_input = False

    def tsort(self, program=None):
        "Topologically sort the dataflow graph nodes into a program."
        if program is None:
            program = Program()

        for parent in self.parents():
            if parent not in program:
                parent.tsort(program)

        program.add(self)
        return program

    def eval(self, env):
        return self.tsort().eval(env)

    def graphviz(self):
        return self.tsort().graphviz()

    def repr_step(self, program):
        return '%s(%s)' % (self.__class__.__name__,
                           ', '.join(str(program.index(arg))
                                     for arg in self.parents()))

    def __add__(self, other):
        return Add(self, other)

    def __radd__(self, other):
        return Add(other, self)

    def __sub__(self, other):
        return Sub(self, other)

    def __rsub__(self, other):
        return Sub(other, self)

    def __neg__(self):
        return -1 * self

    def __mul__(self, other):
        return Mul(self, other)

    def __rmul__(self, other):
        return Mul(other, self)

    def __truediv__(self, other):
        return Div(self, other)

    def __rtruediv__(self, other):
        return Div(other, self)

    def __pow__(self, exp):
        return Pow(self, exp)

    def __rpow__(self, base):
        return Pow(base, self)

    def __abs__(self):
        return Abs(self)


class Var(Node):
    "An input to the dataflow graph; an independent variable."
    def __init__(self, name):
        self.name = name

    is_input = True

    def parents(self):
        return ()

    def __repr__(self):
        return 'Var(%r)' % (self.name,)

    def repr_step(self, program):
        return repr(self)

    def apply(self, args, env):
        return env[self]

    def render_graphviz(self, canvas):
        canvas.node(self, self.name, shape='diamond')


class Add(Node):
    "An addition dataflow graph node."
    def __init__(self, a, b):
        self.a = as_graph(a)
        self.b = as_graph(b)

    def parents(self):
        return self.a, self.b

    def __repr__(self):
        return '(%r) + (%r)' % (self.a, self.b)

    def apply(self, args, env):
        a, b = args
        return a + b

    def render_graphviz(self, canvas):
        canvas.node(self, '+')
        canvas.arc(self.a, self)
        canvas.arc(self.b, self)


class Sub(Node):
    "A subtraction dataflow graph node."
    def __init__(self, a, b):
        self._parents = as_graph(a), as_graph(b)

    def parents(self):
        return self._parents

    def __repr__(self):
        return '(%r) - (%r)' % self._parents

    def apply(self, args, env):
        a, b = args
        return a - b

    def render_graphviz(self, canvas):
        canvas.record(self, '{{<a>a|<b>b}|a - b}')
        a, b = self._parents
        canvas.arc(a, self, 'a')
        canvas.arc(b, self, 'b')


class Mul(Node):
    "A multiplication dataflow graph node."
    def __init__(self, a, b):
        self.a = as_graph(a)
        self.b = as_graph(b)

    def parents(self):
        return self.a, self.b

    def __repr__(self):
        return '(%r) * (%r)' % (self.a, self.b)

    def apply(self, args, env):
        a, b = args
        return a * b

    def render_graphviz(self, canvas):
        canvas.node(self, '×')
        canvas.arc(self.a, self)
        canvas.arc(self.b, self)


class Div(Node):
    "A division dataflow graph node.  Interesting because it’s singular."
    def __init__(self, a, b):
        self._parents = as_graph(a), as_graph(b)

    def parents(self):
        return self._parents

    def __repr__(self):
        return '(%r) / (%r)' % self._parents

    def apply(self, args, env):
        a, b = args
        return a / b

    def render_graphviz(self, canvas):
        canvas.record(self, "{<num>num|<denom>denom}|÷")
        num, denom = self._parents
        canvas.arc(num, self, 'num')
        canvas.arc(denom, self, 'denom')


class Pow(Node):
    "A base to a power."
    def __init__(self, base, power):
        self.base = as_graph(base)
        self.power = as_graph(power)

    def parents(self):
        return self.base, self.power

    def __repr__(self):
        return '(%r) ** (%r)' % (self.base, self.power)

    def apply(self, args, env):
        base, power = args
        return base ** power

    def render_graphviz(self, canvas):
        canvas.record(self, "{{<base>base|<power>power}|**}")
        canvas.arc(self.base, self, 'base')
        canvas.arc(self.power, self, 'power')


class Abs(Node):
    "An absolute value node.  Interesting because it’s not differentiable."
    def __init__(self, arg):
        self.arg = as_graph(arg)

    def parents(self):
        return self.arg,

    def __repr__(self):
        return 'abs(%r)' % (self.arg,)

    def apply(self, args, env):
        arg, = args
        return abs(arg)

    def render_graphviz(self, canvas):
        canvas.node(self, '||')
        canvas.arc(self.arg, self)


class Const(Node):
    "A constant in the dataflow graph."
    def __init__(self, value):
        self.value = value

    def parents(self):
        return ()

    def __repr__(self):
        return 'Const(%s)' % (self.value,)

    def __hash__(self):
        return hash(self.value)

    def __eq__(self, other):
        return other.__class__ is self.__class__ and self.value == other.value

    def repr_step(self, program):
        return repr(self)

    def apply(self, args, env):
        return self.value

    def render_graphviz(self, canvas):
        canvas.node(self, str(self.value), shape='box')


def as_graph(obj):
    "Coerce an object to a dataflow graph."
    return obj if isinstance(obj, Node) else Const(obj)


# "Programs".


class Program:
    "A program is an ordered sequence of computational steps."
    def __init__(self):
        self.nodes = []
        self.node_id = {}

    def __contains__(self, node):
        return node in self.node_id

    def index(self, node):
        return self.node_id[node]

    def add(self, node):
        self.node_id[node] = len(self.nodes)
        self.nodes.append(node)

    def inputs(self):
        return [node for node in self.nodes if node.is_input]

    def graphviz(self):
        canvas = Canvas()
        for node in self.nodes:
            node.render_graphviz(canvas)  # subclass must define
        return canvas.render()

    def eval(self, env):
        vals = []

        for node in self.nodes:
            val = node.apply(tuple(vals[self.node_id[parent]]
                                   for parent in node.parents()),
                             env)  # env only used by Var nodes!
            vals.append(val)
            if node is self.nodes[-1]:
                return val

    def __len__(self):
        return len(self.nodes)

    def __repr__(self):
        return 'Program([%s])' % ', '.join('%d: %s' % (i, node.repr_step(self))
                                           for i, node in enumerate(self.nodes))


# Graphviz


class Canvas:
    def __init__(self):
        self.text = ['digraph dataflow {\n']
        self.obj = {}   # maps names to objects
        self.name = {}  # maps objects to names
        self.counter = 0

    def render(self):
        return ''.join(self.text) + '}\n'

    def say(self, *args):
        self.text.extend(list(args))

    def record(self, obj, label):
        self.node(obj, label, shape='record')

    def arc(self, start, end, anchor=None):
        self.say('    ', self.name[start], ' -> ', self.name[end])
        if anchor is not None:
            self.say(':', anchor)
        self.say(';\n')

    def node(self, obj, label, shape=None):
        name = self.register_name(obj)
        self.say('    ', name, ' [')
        opts = {'label': label}
        if shape is not None:
            opts['shape'] = shape
        for opt in opts:
            self.say(opt, '="', opts[opt], '"', ', ')
        self.text.pop()
        self.say('];\n')

    def register_name(self, obj):
        while True:
            label = str(self.counter)
            self.counter += 1
            if label not in self.obj:
                self.obj[label] = obj
                self.name[obj] = label
                return label


# Optimization algorithms.


def randopt(graph, tries=1000):
    "Simplest optimizer: try random inputs.  Use fastopt or walk instead."
    prog = graph.tsort()
    inputs = prog.inputs()
    best = None
    for i in range(tries):
        vals = tuple(generate_value() for var in inputs)
        result = prog.eval(dict(zip(inputs, vals)))
        if best is None or result < best[0]:
            best = result, vals

    return best[0], dict(zip(inputs, best[1])), graph


def generate_value():
    "Dunno, man.  This is probably something reasonable."
    if random.randrange(32) == 0:
        return 0
    sign = random.choice([-1, 1])
    expsign = random.choice([-1, 1])
    exponent = min(1023, max(-1074, random.expovariate(.25)))
    return sign * 2.0**(expsign * exponent)


def randoptvec(graph, tries=100000):
    "Simplest optimizer, numpy-optimized: try random inputs. fastopt is safer."
    prog = graph.tsort()
    inputs = prog.inputs()
    vals = tuple(generate_vec(tries) for var in inputs)
    result = prog.eval(dict(zip(inputs, vals)))
    i = result.argmin()
    return result[i], {var: val[i] for var, val in zip(inputs, vals)}, graph


def generate_vec(tries):
    "A vectorized version of generate_value()."
    sign = numpy.random.randint(0, 2, tries) * 2 - 1
    expsign = numpy.random.randint(0, 2, tries) * 2 - 1
    exponent = numpy.random.exponential(4, tries).clip(-1074, 1023)
    return numpy.where(numpy.random.randint(0, 32, tries) == 0,
                       0,
                       sign * 2.0**(expsign * exponent))


def fastopt(graph, tries=100000, per_iter=10000):
    "Use randoptvec repeatedly in order to not explode memory."
    # Using too many tries in randoptvec is not advisable; you run out
    # of memory.  Some tests with a univariate 31-node dataflow graph
    # (from a version of the complex-number optimization problem at
    # the beginning of the docstring) hung the machine when randoptvec
    # created vectors of 10**8 items, but was able to evaluate that
    # number of tries iteratively as follows:

    # iterations   vector size   time
    #          4      25 × 10⁶   machine hangs
    #          5       2 × 10⁷   machine hangs
    #          6     167 × 10⁵   2'9"
    #          7     143 × 10⁵   2'6"
    #          8     125 × 10⁵   2'6"
    #          9      11 × 10⁶   2'4"
    #         10           10⁷   2'6"
    #         11       9 × 10⁶   2'5"
    #         25       4 × 10⁶   2'2"
    #         50       2 × 10⁶   2'2"
    #        100           10⁶   2'3"
    #        333       3 × 10⁵   2'8"
    #        500       2 × 10⁵   2'9"
    #       1000           10⁵   2'3"
    #        10⁴           10⁴   2'4"
    #        10⁵          1000   3'8"
    #        10⁶           100 15'36" (extrapolated)

    # So, on my laptop, there’s a large flat area of consistent,
    # stable performance, covering three or four orders of magnitude,
    # with fairly steep performance cliffs on each side — caused by
    # memory exhaustion on one side and by CPython interpretive
    # overhead on the other.  I was hoping that sufficiently small
    # arrays would give notably better performance due to fitting into
    # L1 cache, but that doesn’t happen before CPython kills our
    # performance.

    # I think the right answer to this is to pick the smallest vector
    # size that gives us near-optimal performance, which is 10⁴, which
    # gives us the maximal headroom on graph complexity; then we hope
    # that nobody tries to use it with graphs of more than about 31000
    # nodes, or that they have more RAM if they do.

    # However, in an absolute sense, this isn’t really great
    # performance.  2’4" to run 31 operations 100 million times is
    # only 25 million operations per second, and each of this laptop’s
    # 4 1.6GHz cores should be able to manage about 50 times that.

    best = None
    while tries > 0:
        n = min(tries, per_iter)
        tries -= n
        result = randoptvec(graph, tries=n)
        if best is None or result < best:
            best = result
    return best


def walk(graph, iters=1000, width=1000):
    """Optimize using a random walk, sort of like hill climbing or annealing.

    Unlike the simpler minimization algorithm above, this one is
    actually practically useful.

    This algorithm attempts to iteratively improve its best guess so
    far.  Each new candidate is formed by adding a random delta to the
    current best guess; some of these deltas are large, but most are
    small.  Only new candidates that improve over the current best are
    accepted.  The idea is that if you’re in a particularly bad place,
    the large jumps sort of sample the whole space to get you close to
    the global optimum quickly — like random restarts for gradient
    descent — while if you’re already in a reasonable place, wasting a
    small amount of effort on considering large jumps is relatively
    cheap, while most of the effort is spent on tiny improvements.

    In the simple tests I’ve done so far, it seems to be hitting its
    limit when it hits the floor of step sizes.

    """
    prog = graph.tsort()
    inputs = prog.inputs()
    vals = tuple(generate_vec(width) for var in inputs)
    best = None
    i = 0
    while True:
        result = prog.eval(dict(zip(inputs, vals)))
        j = result.argmin()
        latest_best = result[j], tuple(val[j] for val in vals)
        if best is None or latest_best < best:
            # if best is not None: print(i, latest_best[0], steps[j])
            best = latest_best
        i += 1
        if i == iters:
            break
        steps = 2**(numpy.random.exponential(15, width) - 20)
        vals = tuple(val + steps * generate_vec(width) for val in best[1])

    return best[0], dict(zip(inputs, best[1])), graph


def gdforward(graph,
              width=512,
              iters=512,
              initial_learning=0.1,
              final_learning=1e-9,
              start=None):
    """Gradient descent with forward-mode automatic differentiation.

    This algorithm appears to be practical in practice.  `width` is
    the number of random starting points to run the gradient descent
    from; setting it below 256 provides little to no further speedup,
    but whether 256 is adequate for your problem depends on the
    problem.

    `start`, if not None, is a dictionary of variable assignments from
    which to start the gradient descent.  If `width` > 1, other
    starting points will also be tried.

    """
    prog = graph.tsort()
    inputs = prog.inputs()

    initial = numpy.log(initial_learning)
    final   = numpy.log(final_learning)

    next_vals = tuple(generate_vec(width) for var in inputs)
    if start is not None:
        for i, var in enumerate(inputs):
            next_vals[i][0] = start[var]

    for i in range(iters+1):
        vals = next_vals  # This is sort of a lame way to avoid an N½ loop
        result = gradient_forward(prog, inputs, vals)
        learning_rate = numpy.exp(lerp(initial, i/iters, final))
        next_vals = tuple(val - learning_rate * result.deriv(var)
                          for var, val in zip(inputs, vals))

    j = result.value.argmin()
    return result.value[j], {var: val[j] for var, val in zip(inputs, vals)}, graph


def gdm_forward(graph,
                width=512,
                iters=512,
                initial_learning=0.1,
                final_learning=1e-9,
                beta=0.9,
                start=None,
                v=None):
    """Gradient descent with momentum and forward-mode automatic differentiation
    """
    prog = graph.tsort()
    inputs = prog.inputs()

    initial = numpy.log(initial_learning)
    final   = numpy.log(final_learning)

    if v is None:
        v = tuple(0 for var in inputs)

    next_vals = tuple(generate_vec(width) for var in inputs)
    if start is not None:
        for i, var in enumerate(inputs):
            next_vals[i][0] = start[var]

    for i in range(iters+1):
        vals = next_vals
        result = gradient_forward(prog, inputs, vals)
        learning_rate = numpy.exp(lerp(initial, 1/iters, final))
        v = tuple(vi * beta - (1 - beta) * learning_rate * result.deriv(var)
                  for vi, var in zip(v, inputs))
        next_vals = tuple(val + vi for val, vi in zip(vals, v))

    j = result.value.argmin()
    return result.value[j], {var: val[j] for var, val in zip(inputs, vals)}, graph, v


def gradient_forward(prog, inputs, vals):
    return prog.eval({var: Dual(val, {var: 1}) for var, val in zip(inputs, vals)})


def lerp(start, dist, end):
    return start + dist * (end - start)


# Forward-mode automatic differentiation.
# Not yet complete.


class Dual(collections.namedtuple('Dual', ['value', 'derivs'])):
    "A dual value: the value at a point and its first partial derivatives."
    def deriv(self, var):
        return self.derivs.get(var, 0)

    def depend_on(self, other):
        "The set of variables either self or other depends on.  Badly named."
        return set(self.derivs) | set(other.derivs)

    def __add__(self, other):
        other = as_dual(other)
        return Dual(self.value + other.value,
                    {var: self.deriv(var) + other.deriv(var)
                     for var in self.depend_on(other)})

    def __radd__(self, other):
        return self + other

    def __sub__(self, other):
        return self + -other

    def __rsub__(self, other):
        return other + -self

    def __neg__(self):
        return -1 * self

    def __mul__(self, other):
        other = as_dual(other)
        return Dual(self.value * other.value,
                    {var: self.value * other.deriv(var)
                        + other.value * self.deriv(var)
                     for var in self.depend_on(other)})

    def __rmul__(self, other):
        return self * other

    # XXX missing: division, exponentiation.  I can implement division
    # in terms of exponentiation, but I'm not totally clear how to
    # manage exponentiation itself.  I suppose a^b = e^{b ln a}, but
    # that definition only works (in the reals) for positive a.  But I
    # guess that sort of makes sense since taking negative numbers to
    # fractional powers gives you complex numbers anyway?  But I still
    # want to be able to square negative numbers, or take their
    # reciprocal, without complex numbers popping up!

    def __abs__(self):
        # Using numpy here so that we don’t get NaN for
        # abs(Dual(0, {})), and moreover for arrays including 0.
        zeros = (self.value == 0)
        denom = abs(numpy.where(zeros, 1, self.value))
        return self * numpy.where(zeros, 0, (self.value / denom))


def as_dual(obj):
    return obj if isinstance(obj, Dual) else Dual(obj, {})


# Dumb test scripts.

def displacement_2d(pa, pb):
    xa, ya = pa
    xb, yb = pb
    return xa - xb, ya - yb


def distsq(pa, pb):
    dx, dy = displacement_2d(pa, pb)
    return dx*dx + dy*dy


def dumb_geometric_example():
    # Now let’s define a geometric problem.  First, four points.
    p1 = x1, y1 = some.x1, some.y1
    p2 = x2, y2 = some.x2, some.y2
    p3 = x3, y3 = some.x3, some.y3
    p4 = x4, y4 = some.x4, some.y4
    # Let’s say these four points define a rectangle, its diagonal is
    # 5 units, and one of its sides is 3 units.  Note that this is
    # underdetermined, since the position and rotation of the
    # rectangle can be anything.  First, diagonal:
    diag_err = abs(distsq(p1, p3) - 5**2)
    # Then, the side.
    side_err = abs(distsq(p1, p2) - 3**2)

    # To constrain perpendicularity, we can set the dot product of the
    # vectors to zero.
    def dot(disp1, disp2):
        dx1, dy1 = disp1
        dx2, dy2 = disp2
        return dx1*dx2 + dy1*dy2

    rectal_1 = abs(dot(displacement_2d(p1, p2), displacement_2d(p2, p3)))
    rectal_2 = abs(dot(displacement_2d(p2, p3), displacement_2d(p3, p4)))
    rectal_3 = abs(dot(displacement_2d(p3, p4), displacement_2d(p4, p1)))
    rectal_4 = abs(dot(displacement_2d(p4, p1), displacement_2d(p1, p2)))

    total = diag_err + side_err + rectal_1 + rectal_2 + rectal_3 + rectal_4
    return p1, p2, p3, p4, total


def dumb_test_1():
    x = some.x
    f = x**2 + 4*x + 3
    print(f.tsort())
    print(f.eval({x: -2}))
    print(f.graphviz())
    m = randopt(f)
    print(m)
    print(randopt(f + some.y))
    print(randopt(abs(f)))
    print(randopt(x**2 + (some.x - 1)**2))
    print(randopt(-(2**-(some.w**2))))
    print(randopt((3+x)**4))
    print(randopt(x**2 * 3 - 4*x))
    print(randopt((x-1)**2/2 + 1 / (2 + (x+1)**2)))
    print(randoptvec(f))
    print(randoptvec((x-1)**2/2 + 1 / (2 + (x+1)**2)))
    print(randoptvec((x-2)**2 + ((x-4) * (some.y + 3))**2))

    # The example at the beginning of the documentation, except with
    # real solutions instead of complex ones, since randoptvec doesn’t
    # do complex numbers by default.
    z = some.z
    zf = abs(3*z**6 - z**5 + 4*z**4 + 3*z**3 + 6*z**2 + 9*z - 6)
    result = randoptvec(zf, tries=10**6)
    print(result)
    print(len(zf.tsort()))

    last = 0
    for i in range(3):  # used 10000 for earlier testing
        if i > last * 1.2:
            print("iter %d:" % i)
            last = i
        nr = randoptvec(zf, tries=10000)
        if nr[0] < result[0]:
            print("improved to %s at %s" % (nr[0], nr[1]))
            result = nr

    # Let’s try the real thing with complex numbers.
    z2 = some.re + 1j * some.im
    zf2 = abs(3*z2**6 - z2**5 + 4*z2**4 + 3*z2**3 + 6*z2**2 + 9*z2 + 6)
    print(zf2)
    for i in range(3):
        print('found %s at %s' % fastopt(zf2, tries=10**6)[:2])

    # Now let’s try a slightly less stupid optimization algorithm, a
    # sort of hill climbing thing.  It finds solutions precise to 10
    # decimal places in a million evaluations, while “fastopt” only
    # got 2 decimal places in that time.  This algorithm is still
    # derivative-free.
    print(walk(zf2))

    (x1, y1), _, _, (x4, y4), total = dumb_geometric_example()
    result = err, inputs, graph = walk(total)
    print(len(graph.tsort()))
    print(result)
    # This printed 3.99999974566, 3.9999997827, and 4.00000006745
    # the last three runs, which is a pretty good sign:
    print(distsq((inputs[x1], inputs[y1]),
                 (inputs[x4], inputs[y4]))**0.5)
    # width  iters to apparently always converge
    #   100  110
    #    1k   25
    #   10k   15
    #  100k    8
    #    1M    4

    # Here “apparently always converges” means something like “has
    # less than 1% error in the final distance 75% of the time,” which
    # is not a particularly demanding level of performance but is easy
    # to measure.

    # Using widths as low as 100 is probably not useful:
    # width  ms per iter (from a previous version of the code)
    #    10   2.7
    #    60   2.9
    #   100   3.0
    #   160   3.4
    #  1000   6.0
    # 10000  32.0
    # (Last I measured, 100 now takes 2.0 ms.)

    # That is, you can do two iterations that are each 100 points wide
    # in the time for an iteration of 1000, but you need something
    # like four times as many iterations, at least for this
    # undemanding level of performance.  110 iterations of 100 points
    # are half as much computation as 25 iterations of 1000 points,
    # but they take twice as long in Python.

    # Note that the absolute level of performance here is abysmal: to
    # resolve the constraints on a single rectangle with four points,
    # we need 150 milliseconds,
    # or maybe 100 milliseconds with the new code.
    # But it seems that this is an
    # algorithm that reaches the level of being useful in practice,
    # even if marginally.

    print(graph.graphviz())
    print(graph.tsort())


def dumb_test_2():
    x = some.x
    y = some.y

    # Let’s evaluate a simple multivariate binomial at a point and
    # compute its gradient there.
    f1 = Dual(2.3, {x: 1})
    f2 = Dual(1.1, {y: 1})
    f3 = f1 * 2 + 3 * f2
    print(f3)

    # Now let’s see about interactions between a variable and itself,
    # at x=1.
    x1 = Dual(1, {x: 1})
    print(x1*x1)
    print(1+x1*x1)
    print(x1*x1*x1 - x1*x1)

    # At x=2.
    x2 = Dual(2, {x: 1})
    print(x2*x2*x2 - x2*x2)
    print(1 - x2)

    # At various points.
    xn = Dual(numpy.linspace(-1, 1, 41), {x: 1})
    print(xn*xn)
    print(abs(xn)*xn)

    # Verify we can supply dual numbers as parameters to Program.eval.
    print((abs(x)*x).eval({x: xn}))

    # And that the hack in __abs__() for numpy didn’t hork evaluation
    # with scalars.
    print((abs(x)*x).eval({x: Dual(0, {x: 1})}))
    print(abs(x).eval({x: Dual(0, {x: 1})}))

    # How about a really simple case of gradient descent?
    print(gdforward(abs(x)))

    # That seems to work okay.  How about slightly harder?
    print(gdforward(abs(x+1)))

    # So far, so good.  How about a harder linear equation?
    print(gdforward(abs(2*x-7)))

    # How about the simplest quadratic equation?
    print(gdforward(x*x))

    # Slightly harder?
    print(gdforward((x+3)*(x+3)))

    # In polynomial form?
    print(gdforward(x*x + 6*x + 9))

    # Finding the minimum, without the constant?
    print(gdforward(x*x + 6*x))

    # Now, what if we try to find the roots?  This reliably finds the
    # one at 0 because 0 is one of our starting guesses.
    print(gdforward(abs(x*x + 6*x)))

    # If we make it slightly harder, what happens?  This works too,
    # finding -0.172 or occasionally -5.828.
    print(gdforward(abs(x*x + 6*x + 1)))

    # This does too, finding 1.541 and -4.5414.
    loss = abs(x*x + 3*x - 7)
    result = gdforward(loss)
    print(result)
    print('=', result[2].eval(result[1]))  # I had a bug!

    # These reduced settings still more or less work:
    print()
    print(gdforward(loss, iters=16))
    print(gdforward(loss, width=1))  # gradient descent without restarts
    # width=128 iters=128 4.223 4.245 4.234
    # width=128 iters=1128 4.606 4.612 4.633
    # 1000 iters took (- 4.612 4.234) = 0.378 seconds, so each
    # iteration at width 128 takes about 400 microseconds.
    # width=256 iters=1128 4.589 4.593 4.666
    # width=256 iters=128 4.207 4.206 4.221
    # 1000 iters took (- 4.207 4.593) = 0.386 seconds, so each
    # iteration at width 256 takes about 400 microseconds still.
    # width=512 iters=1128 4.719 4.689 4.671
    # width=512 iters=128 4.240 4.264 4.233
    # 1000 iters took (- 4.689 4.240) = 0.449 seconds, so at 512
    # width, we start to see a slowdown.
    print(gdforward(loss, width=128, iters=16))
    print()

    # Let’s try the geometric example.  This takes 6ms per iteration,
    # while the analogous random walk only took 3ms per iteration,
    # even though in theory we are in an 8-dimensional realm that
    # should give a substantial advantage to gradient descent.
    (x1, y1), _, _, (x4, y4), total = dumb_geometric_example()
    result = err, inputs, graph = gdforward(total, width=128, iters=128)
    print(result)
    print(distsq((inputs[x1], inputs[y1]),
                 (inputs[x4], inputs[y4]))**0.5)


if __name__ == '__main__':
    dumb_test_2()