#!/usr/bin/python # -*- coding: utf-8 -*- """Attempt to numerically solve a statistics problem posed by Nick Johnson. As he originally posed it: > So, I have a stats problem similar to the card collector’s paradox. Suppose each item is an n-bit bit vector. Each bit has probability 0.5 of being set to 1. How many items do I need to collect such that between them, there is a 0 in every position of the vector? > Now, how does that change if the probability of each bit being 1 can vary? In particular, how many items do you need to collect to get to probability 0.5? I proposed that it would be easy to solve it with the method of secants, but that turns out to be surprisingly tricky — the exponentials involved easily provoke division-by-zero problems and loss of precision in the method of secants. Consider the case where n = 256 and each bit probability is 0.25. For less than six trials, the probability of success is indistinguishable from 0 if you subtract 0.5 from it — as you would do to get a function whose root you want to find. But for 44 or more trials, the probability of success is 0.999 or greater. That means that dividing by the slope, as one does, projects one rather far out — easily into an effectively constant region. Moreover, the straightforward expression, Πᵢ 1 - (1-pᵢ)ʳ (where now r is the number of trials), misbehaves for r negative, which poses problems for the method of secants. Consider ∀i: pᵢ = ½ as before and evaluate for r = -10; we get 1 - 1024 = -1023, and thus depending on whether we have an odd or even number of bits in the bitvector, we get a large negative or a large positive answer. Either way, we get exponentially large numbers and entirely unhelpful behavior out of the method of secants. Even if we hack around that, we still have the problem that the function is a good approximation of constant 0 for values that are too small, then a good approximation of constant 1 for values that are too large. For ∀i: pᵢ = 0.025 and n = 256, the success probability is between 0.001 and 0.999 only from m = 144 to m = 492. This means that to get the method of secants to converge at all, we need to start by guessing the right answer to within about an order of magnitude. (But then it finishes converging to machine precision within about 10 iterations, and to within an error of 1 in about 7 iterations.) The standard way to use Newton–Raphson iteration is by first using binary chop to get “close enough” to guarantee convergence, then start the Newton–Raphson algorithm. It seems that, on functions like this, the method of secants has the same problem. A different problem — maybe an easier one to approximate numerically — is to find the *expected* number of items you need to collect to get a success at every bit position; what is discussed above is the *median* number. The expected number is what you get if you iterate through the item-collecting procedure many times and divide the total number of items you’ve collected across all the iterations by the number of times you went through the procedure. For some distributions, like the Cauchy distribution, such a number doesn’t even exist; that quotient does not tend toward some well-defined limit. This distribution does have a well-defined expectation — it’s the geometric distribution in the case n = 1, and even thinner-tailed for n > 1 — but I don’t know what it is, and it’s probably not the median. """ import itertools def p(ps, m): """Calculate universal success probability for m trials with probabilities ps. That is, we have some vector of success probabilities ps for n different Bernoulli processes, and we go through m trials. What is the probability that we have at least one success in every process? """ if m < 0: # This is a bogus case, so let’s return something continuous # with the right sign and a reasonable derivative to appease # the method of secants. return -(-m)**0.5 success = 1.0 for p in ps: success *= 1 - (1 - p)**m return success def sec_seq(f, x0, x1): "The method of secants for finding a zero of a continuous function." y = f(x0), f(x1) while y[1] != y[0]: yield x1 x0, x1 = x1, x1 - y[1]*(x1 - x0)/(y[1] - y[0]) y = y[1], f(x1) if __name__ == '__main__': f = lambda x: p([0.025 for i in range(256)], x) - 0.5 for x in itertools.islice(sec_seq(f, 50, 500), 0, 20): print(x)