#!/usr/bin/python3 # -*- coding: utf-8 -*- """Simple reverse-mode automatic differentiation in Python. There are a number of problems for which it’s relatively easy to write a program to determine whether a given configuration is a solution, but relatively hard to write a program to find one. For example: - find a complex number z for which -6 = 3z⁶ - z⁵ + 4z⁴ + 3z³ + 6z² + 9z (there’s one close to -0.134714309 + 1.1861122304i, for example, and another near -0.697095529 + 0.38106567i, and each of those has a conjugate); - find a non-overlapping spatial arrangement of letters from only a few fonts that closely approximates a given page image (predicting, say, 99% of its pixels); - find the solution x to a set of linear equations Ax = b, where A is a matrix and x and b are vectors; - find the distribution of stresses in a given piece of elastic material given its structure and material properties, and some static loads and boundary conditions; - find a way to lay out a given JSON data structure within 15 lines and 80 columns. The problem of finding such a solution, given a program to recognize it, is known as an “inverse problem”, which is an unfortunately vague name. There’s a closely related class of problems known, confusingly, as “programming”, “mathematical programming”, “optimization”, or “mathematical optimization”, in which the objective is to find the minimum of some function. I'm going to call them “minimization problems”. The above inverse-problem examples can be in some sense interchanged with the following optimization problems: - find a complex number z which minimizes |3z⁶ - z⁵ + 4z⁴ + 3z³ + 6z² + 9z + 6|; - find the non-overlapping spatial arrangement of letters from only a few fonts that minimizes the approximation error to a given page image; - find a vector x̂ that minimizes |Ax̂ - b|; - find a possible distribution of stresses in a given piece of elastic material which, given its structure, material properties, loads, and boundary conditions, minimizes the accelerations it would be experiencing; - minimize the number of lines to lay out a given JSON data structure within and 80 columns. A note about terminology: the thing you’re trying to find the minimum of is called the “objective function” or “cost function”; the things you can change to try to lower it are called the “design variables”. These problems are equivalent in the following sense. If you can solve the minimization problem correctly, the solution will be a solution to the corresponding inverse problem, if it exists; and, conversely, if you can solve the inverse problem, and it has some kind of error-tolerance parameter, you can solve it repeatedly with gradually tightening error bounds to approximate the solution to the minimization problem to a desired precision. Reframing an inverse problem as a minimization problem often makes it easier to solve, because there are a lot of algorithms available for solving minimization problems. As I understand it, the algorithms that work efficiently for high-dimensionality problems (in practice) require you to be able to not only evaluate the objective function at different points, but also to evaluate its partial derivatives with respect to the design variables — that is to say, the gradient of the objective function with respect to the design variables. It turns out that any computable function also has a computable derivative, in a constant factor more computation, to the same precision, using a procedure known as “automatic differentiation”. This can be done in “forward mode”, which is very simple to implement, simply accompanying each intermediate quantity in the computation with its gradient with respect to the independent variables. But this is slow in the case where there are many independent variables; if you have N independent variables and compute M intermediate quantities, you have to do N·M computations, N times more than merely computing the function pointwise. In the case where you only care about the gradient of one or a few dependent variables — such as the objective function in a minimization problem — a much faster alternative is available, called “reverse-mode automatic differentiation”. You can propagate the gradient information *backwards* through the dataflow graph, computing the loading of the dependent variables on each intermediate quantity in the reverse order that you computed the quantities in the first place. This can be many orders of magnitude faster than forward-mode for minimization problems. In short, reverse-mode automatic differentiation is a crucial enabling technology for automatic differentiation; automatic differentiation is a crucial enabling technology for solving minimization problems; and solving minimization problems is a crucial enabling technology for solving inverse problems. I’m writing this program in order to understand how to implement reverse-mode automatic differentiation. What’s implemented so far: * Operator overloading to construct a dataflow graph * Optimization of it using random search * First-order forward-mode automatic differentiation of it * Optimization of it using gradient descent Not yet: - Gradient descent optimization at a reasonable speed - Gradient descent optimization that works reliably? - Reverse-mode automatic differentiation """ from __future__ import division, print_function import collections import random import numpy # Dataflow graph construction. class Vars: "Syntactic sugar to define variables to start the dataflow graph with." def __init__(self): self.assigned = set() def __getattr__(self, name): if name not in self.assigned: self.assigned.add(name) return Var(name) i = 0 while True: n = '%s%d' % (name, i) i += 1 if n not in self.assigned: self.assigned.add(n) return Var(n) some = Vars() class Node: "A node in the dataflow graph; abstract base class." def inputs(self): "Return the Vars that affect this node, transitively." return self.tsort().inputs() is_input = False def tsort(self, program=None): "Topologically sort the dataflow graph nodes into a program." if program is None: program = Program() for parent in self.parents(): if parent not in program: parent.tsort(program) program.add(self) return program def eval(self, env): return self.tsort().eval(env) def graphviz(self): return self.tsort().graphviz() def repr_step(self, program): return '%s(%s)' % (self.__class__.__name__, ', '.join(str(program.index(arg)) for arg in self.parents())) def __add__(self, other): return Add(self, other) def __radd__(self, other): return Add(other, self) def __sub__(self, other): return Sub(self, other) def __rsub__(self, other): return Sub(other, self) def __neg__(self): return -1 * self def __mul__(self, other): return Mul(self, other) def __rmul__(self, other): return Mul(other, self) def __truediv__(self, other): return Div(self, other) def __rtruediv__(self, other): return Div(other, self) def __pow__(self, exp): return Pow(self, exp) def __rpow__(self, base): return Pow(base, self) def __abs__(self): return Abs(self) class Var(Node): "An input to the dataflow graph; an independent variable." def __init__(self, name): self.name = name is_input = True def parents(self): return () def __repr__(self): return 'Var(%r)' % (self.name,) def repr_step(self, program): return repr(self) def apply(self, args, env): return env[self] def render_graphviz(self, canvas): canvas.node(self, self.name, shape='diamond') class Add(Node): "An addition dataflow graph node." def __init__(self, a, b): self.a = as_graph(a) self.b = as_graph(b) def parents(self): return self.a, self.b def __repr__(self): return '(%r) + (%r)' % (self.a, self.b) def apply(self, args, env): a, b = args return a + b def render_graphviz(self, canvas): canvas.node(self, '+') canvas.arc(self.a, self) canvas.arc(self.b, self) class Sub(Node): "A subtraction dataflow graph node." def __init__(self, a, b): self._parents = as_graph(a), as_graph(b) def parents(self): return self._parents def __repr__(self): return '(%r) - (%r)' % self._parents def apply(self, args, env): a, b = args return a - b def render_graphviz(self, canvas): canvas.record(self, '{{a|b}|a - b}') a, b = self._parents canvas.arc(a, self, 'a') canvas.arc(b, self, 'b') class Mul(Node): "A multiplication dataflow graph node." def __init__(self, a, b): self.a = as_graph(a) self.b = as_graph(b) def parents(self): return self.a, self.b def __repr__(self): return '(%r) * (%r)' % (self.a, self.b) def apply(self, args, env): a, b = args return a * b def render_graphviz(self, canvas): canvas.node(self, '×') canvas.arc(self.a, self) canvas.arc(self.b, self) class Div(Node): "A division dataflow graph node. Interesting because it’s singular." def __init__(self, a, b): self._parents = as_graph(a), as_graph(b) def parents(self): return self._parents def __repr__(self): return '(%r) / (%r)' % self._parents def apply(self, args, env): a, b = args return a / b def render_graphviz(self, canvas): canvas.record(self, "{num|denom}|÷") num, denom = self._parents canvas.arc(num, self, 'num') canvas.arc(denom, self, 'denom') class Pow(Node): "A base to a power." def __init__(self, base, power): self.base = as_graph(base) self.power = as_graph(power) def parents(self): return self.base, self.power def __repr__(self): return '(%r) ** (%r)' % (self.base, self.power) def apply(self, args, env): base, power = args return base ** power def render_graphviz(self, canvas): canvas.record(self, "{{base|power}|**}") canvas.arc(self.base, self, 'base') canvas.arc(self.power, self, 'power') class Abs(Node): "An absolute value node. Interesting because it’s not differentiable." def __init__(self, arg): self.arg = as_graph(arg) def parents(self): return self.arg, def __repr__(self): return 'abs(%r)' % (self.arg,) def apply(self, args, env): arg, = args return abs(arg) def render_graphviz(self, canvas): canvas.node(self, '||') canvas.arc(self.arg, self) class Const(Node): "A constant in the dataflow graph." def __init__(self, value): self.value = value def parents(self): return () def __repr__(self): return 'Const(%s)' % (self.value,) def __hash__(self): return hash(self.value) def __eq__(self, other): return other.__class__ is self.__class__ and self.value == other.value def repr_step(self, program): return repr(self) def apply(self, args, env): return self.value def render_graphviz(self, canvas): canvas.node(self, str(self.value), shape='box') def as_graph(obj): "Coerce an object to a dataflow graph." return obj if isinstance(obj, Node) else Const(obj) # "Programs". class Program: "A program is an ordered sequence of computational steps." def __init__(self): self.nodes = [] self.node_id = {} def __contains__(self, node): return node in self.node_id def index(self, node): return self.node_id[node] def add(self, node): self.node_id[node] = len(self.nodes) self.nodes.append(node) def inputs(self): return [node for node in self.nodes if node.is_input] def graphviz(self): canvas = Canvas() for node in self.nodes: node.render_graphviz(canvas) # subclass must define return canvas.render() def eval(self, env): vals = [] for node in self.nodes: val = node.apply(tuple(vals[self.node_id[parent]] for parent in node.parents()), env) # env only used by Var nodes! vals.append(val) if node is self.nodes[-1]: return val def __len__(self): return len(self.nodes) def __repr__(self): return 'Program([%s])' % ', '.join('%d: %s' % (i, node.repr_step(self)) for i, node in enumerate(self.nodes)) # Graphviz class Canvas: def __init__(self): self.text = ['digraph dataflow {\n'] self.obj = {} # maps names to objects self.name = {} # maps objects to names self.counter = 0 def render(self): return ''.join(self.text) + '}\n' def say(self, *args): self.text.extend(list(args)) def record(self, obj, label): self.node(obj, label, shape='record') def arc(self, start, end, anchor=None): self.say(' ', self.name[start], ' -> ', self.name[end]) if anchor is not None: self.say(':', anchor) self.say(';\n') def node(self, obj, label, shape=None): name = self.register_name(obj) self.say(' ', name, ' [') opts = {'label': label} if shape is not None: opts['shape'] = shape for opt in opts: self.say(opt, '="', opts[opt], '"', ', ') self.text.pop() self.say('];\n') def register_name(self, obj): while True: label = str(self.counter) self.counter += 1 if label not in self.obj: self.obj[label] = obj self.name[obj] = label return label # Optimization algorithms. def randopt(graph, tries=1000): "Simplest optimizer: try random inputs. Use fastopt or walk instead." prog = graph.tsort() inputs = prog.inputs() best = None for i in range(tries): vals = tuple(generate_value() for var in inputs) result = prog.eval(dict(zip(inputs, vals))) if best is None or result < best[0]: best = result, vals return best[0], dict(zip(inputs, best[1])), graph def generate_value(): "Dunno, man. This is probably something reasonable." if random.randrange(32) == 0: return 0 sign = random.choice([-1, 1]) expsign = random.choice([-1, 1]) exponent = min(1023, max(-1074, random.expovariate(.25))) return sign * 2.0**(expsign * exponent) def randoptvec(graph, tries=100000): "Simplest optimizer, numpy-optimized: try random inputs. fastopt is safer." prog = graph.tsort() inputs = prog.inputs() vals = tuple(generate_vec(tries) for var in inputs) result = prog.eval(dict(zip(inputs, vals))) i = result.argmin() return result[i], {var: val[i] for var, val in zip(inputs, vals)}, graph def generate_vec(tries): "A vectorized version of generate_value()." sign = numpy.random.randint(0, 2, tries) * 2 - 1 expsign = numpy.random.randint(0, 2, tries) * 2 - 1 exponent = numpy.random.exponential(4, tries).clip(-1074, 1023) return numpy.where(numpy.random.randint(0, 32, tries) == 0, 0, sign * 2.0**(expsign * exponent)) def fastopt(graph, tries=100000, per_iter=10000): "Use randoptvec repeatedly in order to not explode memory." # Using too many tries in randoptvec is not advisable; you run out # of memory. Some tests with a univariate 31-node dataflow graph # (from a version of the complex-number optimization problem at # the beginning of the docstring) hung the machine when randoptvec # created vectors of 10**8 items, but was able to evaluate that # number of tries iteratively as follows: # iterations vector size time # 4 25 × 10⁶ machine hangs # 5 2 × 10⁷ machine hangs # 6 167 × 10⁵ 2'9" # 7 143 × 10⁵ 2'6" # 8 125 × 10⁵ 2'6" # 9 11 × 10⁶ 2'4" # 10 10⁷ 2'6" # 11 9 × 10⁶ 2'5" # 25 4 × 10⁶ 2'2" # 50 2 × 10⁶ 2'2" # 100 10⁶ 2'3" # 333 3 × 10⁵ 2'8" # 500 2 × 10⁵ 2'9" # 1000 10⁵ 2'3" # 10⁴ 10⁴ 2'4" # 10⁵ 1000 3'8" # 10⁶ 100 15'36" (extrapolated) # So, on my laptop, there’s a large flat area of consistent, # stable performance, covering three or four orders of magnitude, # with fairly steep performance cliffs on each side — caused by # memory exhaustion on one side and by CPython interpretive # overhead on the other. I was hoping that sufficiently small # arrays would give notably better performance due to fitting into # L1 cache, but that doesn’t happen before CPython kills our # performance. # I think the right answer to this is to pick the smallest vector # size that gives us near-optimal performance, which is 10⁴, which # gives us the maximal headroom on graph complexity; then we hope # that nobody tries to use it with graphs of more than about 31000 # nodes, or that they have more RAM if they do. # However, in an absolute sense, this isn’t really great # performance. 2’4" to run 31 operations 100 million times is # only 25 million operations per second, and each of this laptop’s # 4 1.6GHz cores should be able to manage about 50 times that. best = None while tries > 0: n = min(tries, per_iter) tries -= n result = randoptvec(graph, tries=n) if best is None or result < best: best = result return best def walk(graph, iters=1000, width=1000): """Optimize using a random walk, sort of like hill climbing or annealing. Unlike the simpler minimization algorithm above, this one is actually practically useful. This algorithm attempts to iteratively improve its best guess so far. Each new candidate is formed by adding a random delta to the current best guess; some of these deltas are large, but most are small. Only new candidates that improve over the current best are accepted. The idea is that if you’re in a particularly bad place, the large jumps sort of sample the whole space to get you close to the global optimum quickly — like random restarts for gradient descent — while if you’re already in a reasonable place, wasting a small amount of effort on considering large jumps is relatively cheap, while most of the effort is spent on tiny improvements. In the simple tests I’ve done so far, it seems to be hitting its limit when it hits the floor of step sizes. """ prog = graph.tsort() inputs = prog.inputs() vals = tuple(generate_vec(width) for var in inputs) best = None i = 0 while True: result = prog.eval(dict(zip(inputs, vals))) j = result.argmin() latest_best = result[j], tuple(val[j] for val in vals) if best is None or latest_best < best: # if best is not None: print(i, latest_best[0], steps[j]) best = latest_best i += 1 if i == iters: break steps = 2**(numpy.random.exponential(15, width) - 20) vals = tuple(val + steps * generate_vec(width) for val in best[1]) return best[0], dict(zip(inputs, best[1])), graph def gdforward(graph, width=512, iters=512, initial_learning=0.1, final_learning=1e-9, start=None): """Gradient descent with forward-mode automatic differentiation. This algorithm appears to be practical in practice. `width` is the number of random starting points to run the gradient descent from; setting it below 256 provides little to no further speedup, but whether 256 is adequate for your problem depends on the problem. `start`, if not None, is a dictionary of variable assignments from which to start the gradient descent. If `width` > 1, other starting points will also be tried. """ prog = graph.tsort() inputs = prog.inputs() initial = numpy.log(initial_learning) final = numpy.log(final_learning) next_vals = tuple(generate_vec(width) for var in inputs) if start is not None: for i, var in enumerate(inputs): next_vals[i][0] = start[var] for i in range(iters+1): vals = next_vals # This is sort of a lame way to avoid an N½ loop result = gradient_forward(prog, inputs, vals) learning_rate = numpy.exp(lerp(initial, i/iters, final)) next_vals = tuple(val - learning_rate * result.deriv(var) for var, val in zip(inputs, vals)) j = result.value.argmin() return result.value[j], {var: val[j] for var, val in zip(inputs, vals)}, graph def gdm_forward(graph, width=512, iters=512, initial_learning=0.1, final_learning=1e-9, beta=0.9, start=None, v=None): """Gradient descent with momentum and forward-mode automatic differentiation """ prog = graph.tsort() inputs = prog.inputs() initial = numpy.log(initial_learning) final = numpy.log(final_learning) if v is None: v = tuple(0 for var in inputs) next_vals = tuple(generate_vec(width) for var in inputs) if start is not None: for i, var in enumerate(inputs): next_vals[i][0] = start[var] for i in range(iters+1): vals = next_vals result = gradient_forward(prog, inputs, vals) learning_rate = numpy.exp(lerp(initial, 1/iters, final)) v = tuple(vi * beta - (1 - beta) * learning_rate * result.deriv(var) for vi, var in zip(v, inputs)) next_vals = tuple(val + vi for val, vi in zip(vals, v)) j = result.value.argmin() return result.value[j], {var: val[j] for var, val in zip(inputs, vals)}, graph, v def gradient_forward(prog, inputs, vals): return prog.eval({var: Dual(val, {var: 1}) for var, val in zip(inputs, vals)}) def lerp(start, dist, end): return start + dist * (end - start) # Forward-mode automatic differentiation. # Not yet complete. class Dual(collections.namedtuple('Dual', ['value', 'derivs'])): "A dual value: the value at a point and its first partial derivatives." def deriv(self, var): return self.derivs.get(var, 0) def depend_on(self, other): "The set of variables either self or other depends on. Badly named." return set(self.derivs) | set(other.derivs) def __add__(self, other): other = as_dual(other) return Dual(self.value + other.value, {var: self.deriv(var) + other.deriv(var) for var in self.depend_on(other)}) def __radd__(self, other): return self + other def __sub__(self, other): return self + -other def __rsub__(self, other): return other + -self def __neg__(self): return -1 * self def __mul__(self, other): other = as_dual(other) return Dual(self.value * other.value, {var: self.value * other.deriv(var) + other.value * self.deriv(var) for var in self.depend_on(other)}) def __rmul__(self, other): return self * other # XXX missing: division, exponentiation. I can implement division # in terms of exponentiation, but I'm not totally clear how to # manage exponentiation itself. I suppose a^b = e^{b ln a}, but # that definition only works (in the reals) for positive a. But I # guess that sort of makes sense since taking negative numbers to # fractional powers gives you complex numbers anyway? But I still # want to be able to square negative numbers, or take their # reciprocal, without complex numbers popping up! def __abs__(self): # Using numpy here so that we don’t get NaN for # abs(Dual(0, {})), and moreover for arrays including 0. zeros = (self.value == 0) denom = abs(numpy.where(zeros, 1, self.value)) return self * numpy.where(zeros, 0, (self.value / denom)) def as_dual(obj): return obj if isinstance(obj, Dual) else Dual(obj, {}) # Dumb test scripts. def displacement_2d(pa, pb): xa, ya = pa xb, yb = pb return xa - xb, ya - yb def distsq(pa, pb): dx, dy = displacement_2d(pa, pb) return dx*dx + dy*dy def dumb_geometric_example(): # Now let’s define a geometric problem. First, four points. p1 = x1, y1 = some.x1, some.y1 p2 = x2, y2 = some.x2, some.y2 p3 = x3, y3 = some.x3, some.y3 p4 = x4, y4 = some.x4, some.y4 # Let’s say these four points define a rectangle, its diagonal is # 5 units, and one of its sides is 3 units. Note that this is # underdetermined, since the position and rotation of the # rectangle can be anything. First, diagonal: diag_err = abs(distsq(p1, p3) - 5**2) # Then, the side. side_err = abs(distsq(p1, p2) - 3**2) # To constrain perpendicularity, we can set the dot product of the # vectors to zero. def dot(disp1, disp2): dx1, dy1 = disp1 dx2, dy2 = disp2 return dx1*dx2 + dy1*dy2 rectal_1 = abs(dot(displacement_2d(p1, p2), displacement_2d(p2, p3))) rectal_2 = abs(dot(displacement_2d(p2, p3), displacement_2d(p3, p4))) rectal_3 = abs(dot(displacement_2d(p3, p4), displacement_2d(p4, p1))) rectal_4 = abs(dot(displacement_2d(p4, p1), displacement_2d(p1, p2))) total = diag_err + side_err + rectal_1 + rectal_2 + rectal_3 + rectal_4 return p1, p2, p3, p4, total def dumb_test_1(): x = some.x f = x**2 + 4*x + 3 print(f.tsort()) print(f.eval({x: -2})) print(f.graphviz()) m = randopt(f) print(m) print(randopt(f + some.y)) print(randopt(abs(f))) print(randopt(x**2 + (some.x - 1)**2)) print(randopt(-(2**-(some.w**2)))) print(randopt((3+x)**4)) print(randopt(x**2 * 3 - 4*x)) print(randopt((x-1)**2/2 + 1 / (2 + (x+1)**2))) print(randoptvec(f)) print(randoptvec((x-1)**2/2 + 1 / (2 + (x+1)**2))) print(randoptvec((x-2)**2 + ((x-4) * (some.y + 3))**2)) # The example at the beginning of the documentation, except with # real solutions instead of complex ones, since randoptvec doesn’t # do complex numbers by default. z = some.z zf = abs(3*z**6 - z**5 + 4*z**4 + 3*z**3 + 6*z**2 + 9*z - 6) result = randoptvec(zf, tries=10**6) print(result) print(len(zf.tsort())) last = 0 for i in range(3): # used 10000 for earlier testing if i > last * 1.2: print("iter %d:" % i) last = i nr = randoptvec(zf, tries=10000) if nr[0] < result[0]: print("improved to %s at %s" % (nr[0], nr[1])) result = nr # Let’s try the real thing with complex numbers. z2 = some.re + 1j * some.im zf2 = abs(3*z2**6 - z2**5 + 4*z2**4 + 3*z2**3 + 6*z2**2 + 9*z2 + 6) print(zf2) for i in range(3): print('found %s at %s' % fastopt(zf2, tries=10**6)[:2]) # Now let’s try a slightly less stupid optimization algorithm, a # sort of hill climbing thing. It finds solutions precise to 10 # decimal places in a million evaluations, while “fastopt” only # got 2 decimal places in that time. This algorithm is still # derivative-free. print(walk(zf2)) (x1, y1), _, _, (x4, y4), total = dumb_geometric_example() result = err, inputs, graph = walk(total) print(len(graph.tsort())) print(result) # This printed 3.99999974566, 3.9999997827, and 4.00000006745 # the last three runs, which is a pretty good sign: print(distsq((inputs[x1], inputs[y1]), (inputs[x4], inputs[y4]))**0.5) # width iters to apparently always converge # 100 110 # 1k 25 # 10k 15 # 100k 8 # 1M 4 # Here “apparently always converges” means something like “has # less than 1% error in the final distance 75% of the time,” which # is not a particularly demanding level of performance but is easy # to measure. # Using widths as low as 100 is probably not useful: # width ms per iter (from a previous version of the code) # 10 2.7 # 60 2.9 # 100 3.0 # 160 3.4 # 1000 6.0 # 10000 32.0 # (Last I measured, 100 now takes 2.0 ms.) # That is, you can do two iterations that are each 100 points wide # in the time for an iteration of 1000, but you need something # like four times as many iterations, at least for this # undemanding level of performance. 110 iterations of 100 points # are half as much computation as 25 iterations of 1000 points, # but they take twice as long in Python. # Note that the absolute level of performance here is abysmal: to # resolve the constraints on a single rectangle with four points, # we need 150 milliseconds, # or maybe 100 milliseconds with the new code. # But it seems that this is an # algorithm that reaches the level of being useful in practice, # even if marginally. print(graph.graphviz()) print(graph.tsort()) def dumb_test_2(): x = some.x y = some.y # Let’s evaluate a simple multivariate binomial at a point and # compute its gradient there. f1 = Dual(2.3, {x: 1}) f2 = Dual(1.1, {y: 1}) f3 = f1 * 2 + 3 * f2 print(f3) # Now let’s see about interactions between a variable and itself, # at x=1. x1 = Dual(1, {x: 1}) print(x1*x1) print(1+x1*x1) print(x1*x1*x1 - x1*x1) # At x=2. x2 = Dual(2, {x: 1}) print(x2*x2*x2 - x2*x2) print(1 - x2) # At various points. xn = Dual(numpy.linspace(-1, 1, 41), {x: 1}) print(xn*xn) print(abs(xn)*xn) # Verify we can supply dual numbers as parameters to Program.eval. print((abs(x)*x).eval({x: xn})) # And that the hack in __abs__() for numpy didn’t hork evaluation # with scalars. print((abs(x)*x).eval({x: Dual(0, {x: 1})})) print(abs(x).eval({x: Dual(0, {x: 1})})) # How about a really simple case of gradient descent? print(gdforward(abs(x))) # That seems to work okay. How about slightly harder? print(gdforward(abs(x+1))) # So far, so good. How about a harder linear equation? print(gdforward(abs(2*x-7))) # How about the simplest quadratic equation? print(gdforward(x*x)) # Slightly harder? print(gdforward((x+3)*(x+3))) # In polynomial form? print(gdforward(x*x + 6*x + 9)) # Finding the minimum, without the constant? print(gdforward(x*x + 6*x)) # Now, what if we try to find the roots? This reliably finds the # one at 0 because 0 is one of our starting guesses. print(gdforward(abs(x*x + 6*x))) # If we make it slightly harder, what happens? This works too, # finding -0.172 or occasionally -5.828. print(gdforward(abs(x*x + 6*x + 1))) # This does too, finding 1.541 and -4.5414. loss = abs(x*x + 3*x - 7) result = gdforward(loss) print(result) print('=', result[2].eval(result[1])) # I had a bug! # These reduced settings still more or less work: print() print(gdforward(loss, iters=16)) print(gdforward(loss, width=1)) # gradient descent without restarts # width=128 iters=128 4.223 4.245 4.234 # width=128 iters=1128 4.606 4.612 4.633 # 1000 iters took (- 4.612 4.234) = 0.378 seconds, so each # iteration at width 128 takes about 400 microseconds. # width=256 iters=1128 4.589 4.593 4.666 # width=256 iters=128 4.207 4.206 4.221 # 1000 iters took (- 4.207 4.593) = 0.386 seconds, so each # iteration at width 256 takes about 400 microseconds still. # width=512 iters=1128 4.719 4.689 4.671 # width=512 iters=128 4.240 4.264 4.233 # 1000 iters took (- 4.689 4.240) = 0.449 seconds, so at 512 # width, we start to see a slowdown. print(gdforward(loss, width=128, iters=16)) print() # Let’s try the geometric example. This takes 6ms per iteration, # while the analogous random walk only took 3ms per iteration, # even though in theory we are in an 8-dimensional realm that # should give a substantial advantage to gradient descent. (x1, y1), _, _, (x4, y4), total = dumb_geometric_example() result = err, inputs, graph = gdforward(total, width=128, iters=128) print(result) print(distsq((inputs[x1], inputs[y1]), (inputs[x4], inputs[y4]))**0.5) if __name__ == '__main__': dumb_test_2()