The Universe of Discourse
           
Fri, 10 Oct 2008

Representing ordinal numbers in the computer and elsewhere
Lately I have been reading Andreas Abel's paper "A semantic analysis of structural recursion", because it was a referred to by David Turner's paper on total functional programming.

The Turner paper is a must-read. It's about functional programming in languages where every program is guaranteed to terminate. This is more useful than it sounds at first.

Turner's initial point is that the presence of ⊥ values in languages like Haskell spoils one's ability to reason from the program specification. His basic example is simple:

        loop :: Integer -> Integer
        loop x = 1 + loop x
Taking the function definition as an equation, we subtract (loop x) from both sides and get
0 = 1
which is wrong. The problem is that while subtracting (loop x) from both sides is valid reasoning over the integers, it's not valid over the Haskell Integer type, because Integer contains a ⊥ value for which that law doesn't hold: 1 ≠ 0, but 1 + ⊥ = 0 + ⊥.

Before you can use reasoning as simple and as familiar as subtracting an expression from both sides, you first have to prove that the value of the expression you're subtracting is not ⊥.

By banishing nonterminating functions, one also banishes ⊥ values, and familiar mathematical reasoning is rescued.

You also avoid a lot of confusing language design issues. The whole question of strictness vanishes, because strictness is solely a matter of what a function does when its argument is ⊥, and now there is no ⊥. Lazy evaluation and strict evaluation come to the same thing. You don't have to wonder whether the logical-or operator is strict in its first argument, or its second argument, or both, or neither, because it comes to the same thing regardless.

The drawback, of course, is that if you do this, your language is no longer Turing-complete. But that turns out to be less of a problem in practice than one would expect.

The paper was so interesting that I am following up several of its precursor papers, including Abel's paper, about which the Turner paper says "The problem of writing a decision procedure to recognise structural recursion in a typed lambda calculus with case-expressions and recursive, sum and product types is solved in the thesis of Andreas Abel." And indeed it is.

But none of that is what I was planning to discuss. Rather, Abel introduces a representation for ordinal numbers that I hadn't thought much about before.

I will work up to the ordinals via an intermediate example. Abel introduces a type Nat of natural numbers:

Nat = 1 ⊕ Nat
The "1" here is not the number 1, but rather a base type that contains only one element, like Haskell's () type or ML's unit type. For concreteness, I'll write the single value of this type as '•'.

The ⊕ operator is the disjoint sum operator for types. The elements of the type ST have one of two forms. They are either left(s) where sS or right(t) where tT. So 1⊕1 is a type with exactly two values: left(•) and right(•).

The values of Nat are therefore left(•), and right(n) for any element n of Nat. So left(•), right(left(•)), right(right(left(•))), and so on. One can get a more familiar notation by defining:

0 = left(•)
Succ(n) = right(n)
And then one just considers 3 to be an abbreviation for Succ(Succ(Succ(0))) as usual. (In this explanation, I omitted some technical details about recursive types.)

So much for the natural numbers. Abel then defines a type of ordinal numbers, as:

Ord = (1 ⊕ Ord) ⊕ (NatOrd)
In this scheme, an ordinal is either left(left(•)), which represents 0, or left(right(n)), which represents the successor of the ordinal n, or right(f), which represents the limit ordinal of the range of the function f, whose tyoe is NatOrd.

We can define abbreviations:

Zero = left(left(•))
Succ(n) = left(right(n))
Lim(f) = right(f)
So 0 = Zero, 1 = Succ(0), 2 = Succ(1), and so on. If we define a function id which maps Nat into Ord in the obvious way:

        id :: NatOrd
        id 0       = Zero
        id (n + 1) = Succ(id n)
then ω = Lim(id). Then we easily get ω+1 = Succ(ω), etc., and the limit of this function is 2ω:

        plusomega :: NatOrd
        plusomega 0       = Lim(id)
        plusomega (n + 1) = Succ(plusomega n)
We can define an addition function on ordinals:

        + :: OrdOrdOrd
        ord + Zero    = ord
        ord + Succ(n) = Succ(ord + n)
        ord + Lim(f)  = Limx. ord + f(x))
This gets us another way to make 2ω: 2ω = Limx.id(x) + ω).

Then this function multiplies a Nat by ω:

        timesomega :: NatOrd
        timesomega 0       = Zero
        timesomega (n + 1) = ω + (timesomega n)
and Lim(timesomega) is ω2. We can go on like this.

But here's what puzzled me. The ordinals are really, really big. Much too big to be a set in most set theories. And even the countable ordinals are really, really big. We often think we have a handle on uncountable sets, because our canonical example is the real numbers, and real numubers are just decimal numbers, which seem simple enough. But the set of countable ordinals is full of weird monsters, enough to convince me that uncountable sets are much harder than most people suppose.

So when I saw that Abel wanted to define an arbitrary ordinals as a limit of a countable sequence of ordinals, I was puzzled. Can you really get every ordinal as the limit of a countable sequence of ordinals? What about Ω, the first uncountable ordinal?

Well, maybe. I can't think of any reason why not. But it still doesn't seem right. It is a very weird sequence, and one that you cannot write down. Because suppose you had a notation for all the ordinals that you would need. But because it is a notation, the set of things it can denote is countable, and so a fortiori the limit of all the ordinals that it can denote is a countable ordinal, not Ω.

And it's all very well to say that the sequence starts out (0, ω, 2ω, ω2, ωω, ε0, ε1, εε0, ...), or whatever, but the beginning of the sequence is totally unimportant; what is important is the end, and we have no way to write the end or to even comprehend what it looks like.

So my question to set theory experts: is every limit ordinal the least upper bound of some countable sequence of ordinals?

I hate uncountable sets, and I have a fantasy that in the mathematics of the 23rd Century, uncountable sets will be looked back upon as a philosophical confusion of earlier times, like Zeno's paradox, or the luminiferous aether.


[Other articles in category /math] permanent link

Tue, 30 Sep 2008

The Lake Wobegon Distribution
Michael Lugo mentioned a while back that most distributions are normal. He does not, of course, believe any such silly thing, so please do not rush to correct him (or me). But the remark reminded me of how many people do seem to believe that most distributions are normal. More than once on internet mailing lists I have encountered people who ridiculed others for asserting that "nearly all x are above [or below] average". This is a recurring joke on Prairie Home Companion, broadcast from the fictional town of Lake Wobegon, where "all the women are strong, all the men are good looking, and all the children are above average." And indeed, they can't all be above average. But they could nearly all be above average. And this is actually an extremely common situation.

To take my favorite example: nearly everyone has an above-average number of legs. I wish I could remember who first brought this to my attention. James Kushner, perhaps?

But the world abounds with less droll examples. Consider a typical corporation. Probably most of the employees make a below-average salary. Or, more concretely, consider a small company with ten employees. Nine of them are paid $40,000 each, and one is the owner, who is paid $400,000. The average salary is $76,000, and 90% of the employees' salaries are below average.

The situation is familiar to people interested in baseball statistics because, for example, most baseball players are below average. Using Sean Lahman's database, I find that 588 players received at least one at-bat in the 2006 National League. These 588 players collected a total of 23,501 hits in 88,844 at-bats, for a collective batting average of .265. Of these 588, only 182 had an individual batting average higher than 265. 69% of the baseball players in the 2006 National League were below-average hitters. If you throw out the players with fewer than 10 at-bats, you are left with 432 players of whom 279, or 65%, hit worse than their collective average of 23430/88325 = .265. Other statistics, such as earned-run averages, are similarly skewed.

The reason for this is not hard to see. Baseball-hitting talent in the general population is normally distributed, like this:

Here the right side of the graph represents the unusually good hitters, of whom there aren't very many. The left side of the graph represents the unusually bad hitters; there aren't many of those either. Most people are somewhere in the middle, near the average, and there are about as many above-average hitters as below-average hitters in the general population.

But major-league baseball players are not the general population. They are carefully selected, among the best of the best. They are all chosen from the right-hand edge of the normal curve. The people in the middle of the normal curve, people like me, play baseball in Clark Park, not in Quankee Stadium.

Here's the right-hand corner of the curve above, highly magnified:

As you can see here, the shape is not at all like the curve for the general population, which had the vast majority of the population in the middle, around the average. Here, the vast majority of the population is way over on the left side, just barely good enough to play in the majors, hanging on to their jobs by the skin of their teeth, subject at any moment to replacement by some kid up from the triple-A minors. The above-average players are the ones over on the right end, the few of the few.

Actually I didn't present the case strongly enough. There are around 800 regular major-league ballplayers in the USA, drawn from a population of around 300 million, a ratio of one per 375,000. Well, no, the ratio is smaller, since the U.S. leagues also draw the best players from Mexico, Venezuela, Canada, the Dominican Republic, Japan, and elsewhere. The curve above is much too inclusive. The real curve for major-league ballplayers looks more like this:

(Note especially the numbers on the y-axis.)

This has important implications for the analysis of baseball. A player who is "merely" above average is a rare and precious resource, to be cherished; far more players are below average. Skilled analysts know that comparisons with the "average" player are misleading, because baseball is full of useful, effective players who are below average. Instead, analysts compare players to a hypothetical "replacement level", which is effectively the leftmost edge of the curve, the level at which a player can be easily replaced by one of those kids from triple-A ball.

In the Historical Baseball Abstract, Bill James describes some great team, I think one of the Cincinnati Big Red Machine teams of the mid-1970s, as "possibly the only team in history that was above average at every position". That's an important thing to know about the sport, and about team sports in general: you don't need great players to completely clobber the opposition; it suffices to have players that are merely above average. But if you're the coach, you'd better learn to make do with a bunch of players who are below average, because that's what you have, and that's what the other team will beat you with.

The right-skewedness of the right side of a normal distribution has implications that are important outside of baseball. Stephen Jay Gould wrote an essay about how he was diagnosed with cancer and given six months to live. This sounds awful, and it is awful. But six months was the expected lifetime for patients with his type of cancer—the average remaining lifetime, in other words—and in fact, nearly everyone with that sort of cancer lived less than six months, usually much less. The average was only skewed up as high as six months because of a few people who took years to die. Gould realized this, and then set about trying to find out how the few long-lived outliers survived and what he could do to turn himself into one of the long-lived freaks. And he succeeded, and lived for twenty years, dying eventually at age 60.

My heavens, I just realized that what I've written is an article about the "long tail". I had no idea I was being so trendy. Sorry, everyone.


[Other articles in category /math] permanent link

Thu, 25 Sep 2008

Sprague-Grundy theory
I'm on a small mailing list for math geeks, and there's this one guy there, Richard Penn, who knows everything. Whenever I come up with some idle speculation, he has the answer. For example, back in 2003 I asked:

Let N be any positive integer. Does there necessarily exist a positive integer k such that the base-10 representation of kN contains only the digits 0 through 4?
M. Penn was right there with the answer.

Yesterday, M. Penn asked a question to which I happened to know the answer, and I was so pleased that I wrote up the whole theory in appalling detail. Since I haven't posted a math article in a while, and since the mailing list only has about twelve people on it, I thought I would squeeze a little more value out of it by posting it here.

Richard Penn asked:

N dots are placed in a circle. Players alternate moves, where a move consists of crossing out any one of the remaining dots, and the dots on each side of it (if they remain). The winner is the player who crosses out the last dot. What is the optimal strategy with 19 dots? with 20? Can you generalize?
M. Penn observed that there is a simple strategy for the 20-dot circle, but was not able to find one for the 19-dot circle. But solving such problems in general is made easy by the Sprague-Grundy theory, which I will explain in detail.

0. Short Spoilers

Both positions are wins for the second player to move.

The 20-dot case is trivial, since any first-player move leaves a row of 17 dots, from which the second player can leave two disconnected rows of 7 dots each. Then any first-player move in one of these rows can be effectively answered by the second player in the other row.

The 19-dot case is harder. The first player's move leaves a row of 16 dots. The second player can win by removing 3 dots to leave disconnected rows of 6 and 7 dots. After this, the strategy is complicated, but is easily found by the Sprague-Grundy theory. It's at the end of this article if you want to skip ahead.

Sprague-Grundy theory is a complete theory of all finite impartial games, which are games like this one where the two players have exactly the same moves from every position.

The theory says:

  1. Every such game position has a "value", which is a non-negative integer.
  2. A position is a second-player win if and only if its value is zero.
  3. The value of a position can be calculated from the values of the positions to which the players can move, in a simple way.
  4. The value of a collection of disjoint positions (such as two disconnected rows of dots) can be calculated from the values of its component positions in a simple way.
Order
Winning Ways for Your Mathematical Plays, Vol. 1
Winning Ways for Your Mathematical Plays, Vol. 1
with kickback
no kickback
Long details follow. They are also found in "Winning Ways", Vol I, by Berlekamp, Conway, and Guy.

1. Nim

In the game of Nim, one has some piles of beans, and a legal move is to remove some or all of the beans from any one pile. The winner is the player who takes the last bean. Equivalently, the winner is the last player who has a legal move.

Nim is important because every position in every impartial game is somehow equivalent to a position in Nim, as we will see. In fact, every position in every impartial game is equivalent to a Nim position with at most one heap of beans! Since single Nim-heaps are trivially analyzed, one can completely analyze any impartial game position by calculating the Nim-heap to which it is equivalent.

2. Disjoint sums of games

Definition: The "disjoint sum" A # B of two games A and B is a new game whose rules are as follows: a legal move in A # B is either a move in A or a move in B; the winner is the last player with a legal move.

Three easy exercises:

  1. # is commutative.
  2. # is associative.
  3. Let (a,b,c...) represent the Nim position with heaps a, b, c, etc. Then the game (a,b,c,...) is precisely (a) # (b) # (c) # ... .
Consider the trivial game with no legal moves for anyone. This game is called 0, because:

0 # a = a # 0 = a
for all games a. 0 is a win for the previous player: the next player to move has no legal moves, and loses.

We will call the next player to move "P1", and the player who just moved "P2".

Note that a Nim-heap of 0 beans is precisely the 0 game.

3. Sums of Nim-heaps

We usually represent a single Nim-heap with n beans as "∗n". I'll do that from now on.

We observed that ∗0 is a win for the second player. Observe now that when n is positive, ∗n is a win for the first player, by a trivial strategy.

From now on we will use the symbol "=" to mean a weaker relation on games than strict equality. Two games A and B will be equivalent if their outcomes are the same in a rather strong sense:

A = B means that for any game X, A # X is a winning position if and only if B # X is also.
Taking X = 0, the condition A = B implies that both games have the same outcome in isolation: if one is a first-player win, so is the other. But the condition is stronger than that. Both ∗1 and ∗2 are first-player wins, but ∗1 ≠ ∗2, because ∗1 # ∗1 is a second-player win, while ∗2 # ∗1 is a first-player win.

Exercise: ∗x = ∗y if and only if x = y.

It so happens that the disjoint sum of two Nim-heaps is equivalent to a single Nim-heap:

Nim-sum theorem:a # ∗b = ∗(ab), Where ⊕ is the bitwise exclusive-or operation.

I'll omit the proof, which is pretty easy to find. ⊕ is often described as "write a and b in binary, and add, ignoring all carries." For example 1 ⊕ 2 = 3, and 13 ⊕ 7 = 10. This implies that ∗1 # ∗2 = ∗3, and that ∗13 # ∗7 = ∗10.

Although I omitted the proof that # for Nim-heaps is essentially the ⊕ operation in disguise, there are many natural implications of this that you can use to verify that the claim is plausible. For example:

  1. The Nim-sum theorem implies that ∗0 is a neutral element for #, which we already knew.
  2. Since aa = 0, we have:
    a # ∗a = ∗0 for all a
    That is, ∗a # ∗a is a win for P2. And indeed, P2 has an obvious strategy: whatever P1 does in one pile, P2 does in the other pile. P2 never runs out of legal moves until after P1 does, and so must win.

  3. Since aa = 0, we have, more generally:
    a # ∗a # X = X for all a, X
    No matter what X is, its outcome is the same as that of ∗a # ∗a # X. Why?

    Suppose you are the player with a winning strategy for playing X alone. Then it is easy to see that you have a winning strategy in ∗a # ∗a # X, as follows: ignore the ∗a # ∗a component, until your opponent moves in it, when you should copy their move in the other half of that component. Eventually the ∗a # ∗a part will be used up (that is, reduced to ∗0 # ∗0 = 0) and your opponent will be forced to move in X, whereupon you can continue your winning strategy there until you win.

  4. According to the ⊕ operation, ∗1 # ∗2 = ∗3, and so ∗1 # ∗2 # ∗3 = ∗3 # ∗3 = 0, so P2 should have a winning strategy in ∗1 # ∗2 # ∗3. Which he does: If P1 removes any entire heap, P2 can win by equalizing the remaining heaps, leaving ∗1 # ∗1 = 0 or ∗2 # ∗2 = 0, which he wins easily. If P1 equalizes any two heaps, P2 can remove the third heap, winning the same way.

  5. Let's reconsider the game of the previous paragraph, but change the ∗1 to something else. 2 ⊕ 3 ⊕ x > 0 so if ∗x ≠ 1, ∗2 # ∗3 # ∗x = ∗y, where y>0. Since ∗y is a single nonempty Nim-heap, it is obviously a win for P1, and so ∗2 # ∗3 # ∗x should be equivalent, also a win for P1. What is P1's winning strategy in ∗2 # ∗3 # ∗x? It's easy. If x > 1, then P1 can reduce ∗x to ∗1, leaving ∗2 # ∗3 # ∗1, which we saw is a winning position. And if x = 0, then P1 can move to ∗2 # ∗2 and win.

4. The MEX rule

The important thing about disjoint sums is that they abstract away the strategy. If you have some complicated set of Nim-heaps ∗a # ∗b # ... # ∗z, you can ignore them and pretend instead that they are a single heap ∗(ab ⊕ ... ⊕ z). Your best move in the compound heap can be easily worked out from the corresponding best move in the fictitious single heap.

For example, how do you figure out how to play in ∗2 # ∗3 # ∗x? You consider it as (∗2 # ∗3) # ∗x = ∗1 # ∗x. That is, you pretend that the ∗2 and the ∗3 are actually a single heap of size 1. Then your strategy is to win in ∗1 # ∗x, which you obviously do by reducing ∗x to size 1, or, if ∗x is already ∗0, by changing ∗1 to ∗0.

Now, that is very facile, but ∗2 # ∗3 is not the same game as ∗1, because from ∗1 there is just one legal move, which is to ∗0. Whereas from ∗2 # ∗3 there are several moves. It might seem that your opponent could complicate the situation, say by moving from ∗2 # ∗3 to ∗3, which she could not do if it were really ∗1.

But actually this extra option can't possibly help your opponent, because you have an easy response to that move, which is to move right back to ∗1! If pretending that ∗2 # ∗3 was ∗1 was good before, it is certainly good after you make it ∗1 for real.

From ∗2 # ∗3 there are a whole bunch of moves:

Move to ∗3
Move to ∗2
Move to ∗1 # ∗3 = ∗2
Move to ∗2 # ∗1 = ∗3
Move to ∗2 # ∗2 = ∗0
But you can disregard the first four of these, because they are reversible: if some player X has a winning strategy that works by pretending that ∗2 # ∗3 is identical with ∗1, then the extra options of moving to ∗2 and ∗3 won't help X's opponent, because X can reverse those moves and turn the ∗2 # ∗3 component back into ∗1. So we can ignore these options, and say that there's just one move from ∗2 # ∗3 worth considering further, namely to ∗2 # ∗2 = 0. Since this is exactly the same set of moves that is available from ∗1, ∗2 # ∗3 behaves just like ∗1 in all situations, and have just proved that ∗2 # ∗3 = ∗1.

Unlike the other moves, the move from ∗2 # ∗3 to ∗0 is not reversible. Once someone turns ∗2 # ∗3 into ∗0, by equalizing the piles, it cannot then be turned back into ∗1, or anything else.

Considering this in more generality, suppose we have some game position P where the options are to move to one of several possible Nim-heaps, and M is the smallest Nim-heap that is not among the options. Then P = ∗M. Why? Because P has just the same options that ∗M has, namely the options of moving to one of ∗0 ... ∗(M-1). P also has some extra options, but we can ignore these because they're reversible. If you have a winning strategy in X # ∗M, then you have a winning strategy in X # P also, as follows:

  • If your opponent plays in X, then follow your strategy for X # ∗M, since the same move will also be available in X # P.

  • If your opponent makes P into ∗y, with y < M, then they've discarded their extra options, which are now irrelevant; play as you would if they had moved from X # ∗M to X # ∗y.

  • If your opponent makes P into ∗y, with y > M, then just move from ∗y to ∗M, leaving X + ∗M, which you can win.

MEX Theorem: If all the legal moves from a position P are equivalent to Nim-heaps of sizes {s1, ..., sk}, then P itself is equivalent to a nim-heap of size MEX(s1, ..., sk), where the MEX is the "Minimal EXcluded" element of the set: the smallest nonnegative integer that is not in the set.

For example, let's consider what happens if we augment Nim by adding a special token, called ♦. A player may, in lieu of a regular move, replace ♦ by a pile of beans of any positive size. What effect does this have on Nim?

Since the legal moves from ♦ are {∗1, ∗2, ∗3, ...} and the MEX is 0, ♦ should behave like ∗0. That is, adding a ♦ token to any position should leave the outcome unaffected. And indeed it does. If you have a winning strategy in game G, then you have a winning strategy in G # ♦ also, as follows: If your opponent plays in G, reply in G. If your opponent replaces ♦ with a pile of beans, remove it, leaving only G.

Exercise: Let G be a game where all the legal moves are to Nim-heaps. Then G is a win for P1 if and only if one of the legal moves from G is to ∗0, and a win for P2 if and only if none of the legal moves from G is to ∗0.

5. The Sprague-Grundy theory

An "impartial game" is one where both players have the same moves from every position.

Sprague-Grundy theorem: Any finite impartial game is equivalent to some Nim-heap ∗n, which is the "Nim-value" of the game.

Now let's consider Richard Penn's game, which is impartial. A legal move is to cross out any dot, and the adjacent dot or dots, if any.

The Sprague-Grundy theorem says that every row of dots in Penn's game is equivalent to some Nim-heap. Let's tabulate the size of this heap (the Nim-value) for each row of n dots. We'll represent a row of n dots as [οοοοο...ο]. Obviously, [] = ∗0 so the Nim-value of [] is 0. Also obviously, [ο] = ∗1, since they're exactly the same game.

[οο] = ∗1 also, since the only legal move from [2] is to [] = 0, and the MEX of {0} is 1.

The legal moves from [οοο] are to [] = ∗0 and [ο] = ∗1, so {∗0, ∗1}, and the MEX is 2. So [οοο] = ∗2.

Let's check that this is working. Since the Nim-value of [οοο] is 2, the theory predicts that [οοο] # ∗2 = 0 and so should be a win for P2. P2 should be able to pretend that [οοο] is actually ∗2.

Suppose P1 turns the ∗2 into ∗1, moving to [οοο] # ∗1. Then P2 should turn [οοο] into ∗1 also, which he can do by crossing out an end dot and the adjacent one, leaving [ο] # ∗1, which he easily wins. If P1 turns ∗2 into ∗0, moving to [οοο] # ∗0, then P2 should turn [οοο] into ∗0 also, which he can do by crossing out the middle and adjacent dots, leaving [] # ∗0, which he wins immediately.

If P1 plays in the [οοο] component, she must move to [] or to [ο], each equivalent to some Nim-heap of size x < 2, and P2 can answer by reducing the true Nim-heap ∗2 to contain x beans also.

Continuing our analysis of rows of dots: In Penn's game, the legal moves from [οοοο] are to [οο] and [ο]. Both of these have Nim-value ∗1, so the MEX is 0.

Easy exercise: Since [οοοο] is supposedly equivalent to ∗0, you should be able to show that a player who has a winning strategy in some game G also has a winning strategy in G + [οοοο].

The legal moves from [οοοοο] are to [οοο], [οο], and [ο] # [ο]. The Nim-values of these three games are ∗2, ∗1, and ∗0 respectively, so the MEX is 3 and [οοοοο] = ∗3.

The legal moves from [οοοοοο] are to [οοοο], [οοο], and [ο] # [οο]. The Nim-values of these three games are 0, 2, and 0, so [οοοοοο] = ∗1.

6. Richard Penn's game analyzed

Row of
n dots
Nim-
value
Winning
move
0 0  
1 1 []
2 1 []
3 2 []
4 0  
5 3 [ο] # [ο]
6 1 [ο] # [οο]
7 1 [οο] # [οο]
8 0  
9 3 [οοο] # [οοο]
10 3 [οοοοοοοο]
11 2 [οοοο] # [οοοο]
12 2 [οο] # [οοοοοοο]
13 4 [οοοοο] # [οοοοο]
14 0  
15 5 [οοοοοο] # [οοοοοο]
16 2 [ο × 14]
17 2 [οοοοοοο] # [οοοοοοο]
18 3 [οοο] # [ο × 12]
19 3 [οοοοοοοο] # [οοοοοοοο]
20 0  
Continuing in this way, we get the table of Nim-values that you see at left.

The table says that a row of 19 dots should be a win for P1, if she reduces the Nim-value from 3 to 0. And indeed, P1 has an easy winning strategy, which is to cross the 3 dots in the middle of the row, replacing [οοοοοοοοοοοοοοοοοοο] with [οοοοοοοο] # [οοοοοοοο]. But no such easy strategy obtains in a row of 20 dots, which, indeed, is a win for P2.

The original question involved circles of dots, not rows. But from a circle of n dots there is only one legal move, which is to a row of n-3 dots. From a circle of 20 dots, the only legal move is to [ο × 17] = ∗2, which should be a win for P1. P1 should win by changing ∗2 to ∗0, so should look for the move from [ο × 17] to ∗0. This is the obvious solution Richard Penn discovered: move to [οοοοοοο] # [οοοοοοο]. So the circle of 20 dots is an easy win for P2, the second player.

But for the circle of 19 dots the answer is the same, a win for the second player. The first player must move to [ο × 16] = ∗2, and then the second player should win by moving to a 0 position. [ο × 16] must have such a move, because if it didn't, the MEX rule would imply that its Nim-value was 0 instead of 2. So what's the second player's zero move here? There are actually two options. The second player can win by playing to [ο × 14], or by splitting the row into [οοοοοο] # [οοοοοοο].


7. Complete strategy for 19-bean circle

Just for completeness, let's follow one of these purportedly winning moves in detail. I claimed that the second player could win by moving to [οοοοοο] # [οοοοοοο]. But what next?

First recall that any isolated row of four dots, [οοοο], can be disregarded, because any first-player move in such a row can be answered by a second-player move that crosses out the rest of the row. And any pair of isolated rows of one or two dots, [ο] or [οο], can be similarly disregarded, because any move that crosses out one can be answered by a move that crosses out the other. So in what follows, positions like [οο] # [ο] # [οοοο] will be assumed to have been won by the second player, and we will say that the second player "has an easy win" if he has a move to such a position.

  • The first player has three possible moves in the left [οοοοοο] component, as follows:

    1. If the first player moves to [οοοο] # [οοοοοοο], the second player has an easy win by moving to [οοοο] # [οοοο].

    2. If the first player moves to [οοο] # [οοοοοοο] = ∗2 # ∗1, the second player should reduce the left component to ∗1, by moving to [ο] # [οοοοοοο]. Then no matter what the first player does, the second player has an easy win.

    3. If the first player moves to [ο] # [οο] # [οοοοοοο] = ∗1 # ∗1 # ∗1, the second player can disregard the [ο] # [οο] component. The second player instead plays to [ο] # [οο] # [οοοο] and wins.

  • The first player has four moves in the right [οοοοοοο] component, as follows:

    1. If the first player moves to [οοοοοο] # [οοοοο] = ∗1 # ∗3, the second player should move from ∗3 to ∗1. There must be a move in [οοοοο] to a position with Nim-value 1. (If there weren't, [οοοοο] would have Nim-value 1 instead of 3, by the MEX rule.) Indeed, the second player can move to [οοοοοο] # [οο]. Now whatever the first player does the second player has an easy win, either to [οοοο] or to X # X for some row X.

    2. If the first player moves to [οοοοοο] # [οοοο] = ∗1 # ∗0, the second player should move from ∗1 to ∗0. There must be a move in [οοοοοο] to a position with Nim-value 0, and indeed there is: the second player moves to [οοοο] # [οοοο] and wins.

    3. If the first player moves to [οοοοοο] # [ο] # [οοο] = ∗1 # ∗1 # ∗2, the second player can disregard the ∗1 # ∗1 component and should move in the ∗2 component, to ∗0, which he does by eliminating it entirely, leaving the first player with [οοοοοο] # [ο]. After any move by the first player the second player has an easy win.

    4. If the first player moves to [οοοοοο] # [οο] # [οο] = ∗1 # ∗1 # ∗1, the second player has a number of good choices. The simplest thing to do is to disregard the [οο] # [οο] component and move in the [οοοοοο] to some position with Nim-value 0. Moving to [οοοο] # [οο] # [οο] suffices.

So [ο × 17] is indeed a win for the next player to move, and a circle of 20 dots is therefore a win for the previous player, who is the second player.

But the important point here is not the strategy itself, which is hard to remember, and which could have been found by computer search. The important thing to notice is that computing the table of Nim-values for each row of n dots is easy, and once you have done this, the rest of the strategy almost takes care of itself. Do you need to find a good move from [οοοοοοο] # [οοοοοοοοο] # [οοοοοοοοοο]? There's no need to worry, because the table says that this can be viewed as ∗1 # ∗3 # ∗3, and so a good move is to reduce the ∗1 component, the [οοοοοοο], to ∗0, say by changing it to [οοοο] or to [οο] # [οο]. Whatever your opponent does next, calculating your reply will be similarly easy.


[Other articles in category /math] permanent link

Wed, 17 Sep 2008

data Mu f = In (f (Mu f))
Last week I wrote about one of two mindboggling pieces of code that appears in the paper Functional Programming with Overloading and Higher-Order Polymorphism, by Mark P. Jones. Today I'll write about the other one. It looks like this:

        data Mu f = In (f (Mu f))                       -- (???)
I bet a bunch of people reading this on Planet Haskell are nodding and saying "Oh, that!"

When I first saw this I couldn't figure out what it was saying at all. It was totally opaque. I still have trouble recognizing in Haskell what tokens are types, what tokens are type constructors, and what tokens are value constructors. Code like (???) is unusually confusing in this regard.

Normally, one sees something like this instead:

        data Maybe f = Nothing | Just f
Here f is a type variable; that is, a variable that ranges over types. Maybe is a type constructor, which is like a function that you can apply to a type to get another type. The most familiar example of a type constructor is List:

        data List e = Nil | Cons e (List e)
Given any type f, you can apply the type constructor List to f to get a new type List f. For example, you can apply List to Int to get the type List Int. (The Haskell built-in list type constructor goes by the funny name of [], but works the same way. The type [Int] is a synonym for ([] Int).)

Actually, type names are type constructors also; they're argumentless type constructors. So we have type constructors like Int, which take no arguments, and type constructors like List, which take one argument. Haskell also has type constructors that take more than one argument. For example, Haskell has a standard type constructor called Either for making union types:

        data Either a b = Left a | Right b;
Then the type Either Int String contains values like Left 37 and Right "Cotton Mather".

To keep track of how many arguments a type constructor has, one can consider the, ahem, type, of the type constructor. But to avoid the obvious looming terminological confusion, the experts use the word "kind" to refer to the type of a type constructor. The kind of List is * → *, which means that it takes a type and gives you back a type. The kind of Either is * → * → *, which means that it takes two types and gives you back a type. Well, actually, it is curried, just like regular functions are, so that Either Int is itself a type constructor of kind * → * which takes a type a and returns a type which could be either an Int or an a. The nullary type constructor Int has kind *.

Continuing the "Maybe" example above, f is a type, or a constructor of kind *, if you prefer. Just is a value constructor, of type fMaybe f. It takes a value of type f and produces a value of type Maybe f.

Now here is a crucial point. In declarations of type constructors, such as these:

        data Either a b = ...
        data List e = ...
        data Maybe f = ...
the type variables a, b, e, and f actually range over type constructors, not over types. Haskell can infer the kinds of the type constructors Either, List, and Maybe, and also the kinds of the type variables, from the definitions on the right of the = signs. In this case, it concludes that all four variables must have kind *, and so really do represent types, and not higher-order type constructors. So you can't ask for Either Int List because List is known to have kind * → *, and Haskell needs a type constructor of kind * to serve as an argument to Either.

But with a different definition, Haskell might infer that a type variable has a higher-order kind. Here is a contrived example, which might be good for something, perhaps. I'm not sure:

        data TyCon f = ValCon (f Int)
This defines a type constructor TyCon with kind (* → *) → *, which can be applied to any type constuctor f that has kind * → *, to yield a type. What new type? The new type TyCon f is isomorphic to the type f of Int. For example, TyCon List is basically the same as List Int. The value Just 37 has type Maybe Int, and the value ValCon (Just 37) has type TyCon Maybe.

Similarly, the value [1, 2, 3] has type [Int], which, you remember, is a synonym for [] Int. And the value ValCon [1, 2, 3] has type TyCon [].

Now that the jargon is laid out, let's look at (???) again:

        data Mu f = In (f (Mu f))                       -- (???)
When I was first trying to get my head around this, I had trouble seeing what the values were going to be. It looks at first like it has no bottom. The token f here, like in the TyCon example, is a variable that ranges over type constructors with kind * → *, so could be List or Maybe or [], something that takes a type and yields a new type. Mu itself has kind (* → *) → *, taking something like f and yielding a type. But what's an actual value? You need to apply the value constructor In to a value of type f (Mu f), and it's not immediately clear where to get such a thing.

I asked on #haskell, and Cale Gibbard explained it very clearly. To do anything useful you first have to fix f. Let's take f = Maybe. In that particular case, (???) becomes:

        data Mu Maybe = In (Maybe (Mu Maybe))
So the In value constructor will take a value of type Maybe (Mu Maybe) and return a value of type Mu Maybe. Where do we get a value of type Maybe (Mu Maybe)? Oh, no problem: the value Nothing is polymorphic, and has type Maybe a for all a, so in particular it has type Maybe (Mu Maybe). Whatever Maybe (Mu Maybe) is, it is a Maybe-type, so it has a Nothing value. So we do have something to get started with.

Since Nothing is a Maybe (Mu Maybe) value, we can apply the In constructor to it, yielding the value In Nothing, which has type Mu Maybe. Then applying Just, of type a → Maybe a, to In Nothing, of type Mu Maybe, produces Just (In Nothing), of type Maybe (Mu Maybe) again. We can repeat the process as much as we want and produce as many values of type Mu Maybe as we want; they look like these:

        In Nothing
        In (Just (In Nothing))
        In (Just (In (Just (In Nothing))))
        In (Just (In (Just (In (Just (In Nothing))))))
        ...
And that's it, that's the type Mu Maybe, the set of those values. It will look a little simpler if we omit the In markers, which don't really add much value. We can just agree to omit them, or we can get rid of them in the code by defining some semantic sugar:

        nothing = In Nothing
        just = In . Just
Then the values of Mu Maybe look like this:
        nothing
        just nothing
        just (just nothing)
        just (just (just nothing))
        ...
It becomes evident that what the Mu operator does is to close the type under repeated application. This is analogous to the way the fixpoint combinator works on values. Consider the usual definition of the fixpoint combinator:

        Y f = f (Y f)
Here f is a function of type aa. Y f is a fixed point of f. That is, it is a value x of type a such that f x = x. (Put x = Y f in the definition to see this.)

The fixed point of a function f can be computed by considering the limit of the following sequence of values:


f(⊥)
f(f(⊥))
f(f(f(⊥)))
...

This actually finds the least fixed point of f, for a certain definition of "least". For many functions f, like xx + 1, this finds the uninteresting fixed point ⊥, but for many f, like x → λ n. if n = 0 then 1 else n * x(n - 1), it's something better.

Mu is analogous to Y. Instead of operating on a function f from values to values, and producing a single fixed-point value, it operates on a type constructor f from types to types, and produces a fixed-point type. The resulting type T is the least fixed point of the type constructor f, the smallest set of values such that f T = T.

Consider the example of f = Maybe again. We want to find a type T such that T = Maybe T. Consider the following sequence:

{ ⊥ }
Maybe { ⊥ }
Maybe(Maybe { ⊥ })
Maybe(Maybe(Maybe { ⊥ }))
...

The first item is the set that contains nothing but the bottom value, which we might call t0. But t0 is not a fixed point of Maybe, because Maybe { ⊥ } also contains Nothing. So Maybe { ⊥ } is a different type from t0, which we can call t1 = { Nothing, ⊥ }.

The type t1 is not a fixed point of Maybe either, because Maybe t1 evidently contains both Nothing and Just Nothing. Repeating this process, we find that the limit of the sequence is the type Mu Maybe = { ⊥, Nothing, Just Nothing, Just (Just Nothing), Just (Just (Just Nothing)), ... }. This type is fixed under Maybe.

It might be worth pointing out that this is not the only such fixed point, but is is the least fixed point. One can easily find larger types that are fixed under Maybe. For example, postulate a special value Q which has the property that Q = Just Q. Then Mu Maybe ∪ { Q } is also a fixed point of Maybe. But it's easy to see (and to show, by induction) that any such fixed point must be a superset of Mu Maybe. Further consideration of this point might take me off to co-induction, paraconsistent logic, Peter Aczel's nonstandard set theory, and I'd never get back again. So let's leave this for now.

So that's what Mu really is: a fixed-point operator for type constructors. And having realized this, one can go back and look at the definition and see that oh, that's precisely what the definition says, how obvious:

              Y f =     f  (Y f)             -- ordinary fixed-point operator
        data Mu f = In (f (Mu f))            -- (???)
Given f, a function from values to values, Y(f) calculates a value x such that x = f(x). Given f, a function from types to types, Mu(f) calculates a type T such that f(T) = T. That's why the definitions are identical. (Except for that annoying In constructor, which really oughtn't to be there.)

You can use this technique to construct various recursive datatypes. For example, Mu Maybe turns out to be equivalent to the following definition of the natural numbers:

        data Number = Zero | Succ Number;
Notice the structural similarity with the definition of Maybe:

        data Maybe a = Nothing | Just a;
One can similarly define lists:

        data Mu f = In (f (Mu f)) 
        data ListX a b = Nil | Cons a b deriving Show
        type List a = Mu (ListX a)

        -- syntactic sugar
        nil :: List a
        nil = In Nil
        cons :: a → List a → List a
        cons x y = In (Cons x y)

        -- for example
        ls = cons 3 (cons 4 (cons 5 nil))          -- :: List Integer
        lt = (cons 'p' (cons 'y' (cons 'x' nil)))  -- :: List Char
Or you could similarly do trees, or whatever. Why one might want to do this is a totally separate article, which I am not going to write today.

Here's the point of today's article: I find it amazing that Haskell's type system is powerful enough to allow one to defined a fixed-point operator for functions over types.

We've come a long way since FORTRAN, that's for sure.

A couple of final, tangential notes: Google search for "Mu f = In (f (Mu f))" turns up relatively few hits, but each hit is extremely interesting. If you're trying to preload your laptop with good stuff to read on a plane ride, downloading these papers might be a good move.

The Peter Aczel thing seems to be less well-known that it should be. It is a version of set theory that allows coinductive definitions of sets instead of inductive definitions. In particular, it allows one to have a set S = { S }, which standard set theory forbids. If you are interested in co-induction you should take a look at this. You can find a clear explanation of it in Barwise and Etchemendy's book The Liar (which I have read) and possibly also in Aczel's book Non Well-Founded Sets (which I haven't read).


[Other articles in category /prog] permanent link

Thu, 11 Sep 2008

Return return
Among the things I read during the past two months was the paper Functional Programming with Overloading and Higher-Order Polymorphism, by Mark P. Jones. I don't remember why I read this, but it sure was interesting. It is an introduction to the new, cool features of Haskell's type system, with many examples. It was written in 1995 when the features were new. They're no longer new, but they are still cool.

There were two different pieces of code in this paper that wowed me. When I started this article, I was planning to write about #2. I decided that I would throw in a couple of paragraphs about #1 first, just to get it out of the way. This article is that couple of paragraphs.

[ Addendum 20080917: Here's the article about #2. ]

Suppose you have a type that represents terms over some type v of variable names. The v type is probably strings but could possibly be something else:

	data Term v = TVar v                -- Type variable
	            | TInt                  -- Integer type
	            | TString               -- String type
		    | Fun (Term v) (Term v) -- Function type
There's a natural way to make the Term type constructor into an instance of Monad:

	instance Monad Term where
	    return v          = TVar v
	    TVar v   >>= f = f v
            TInt     >>= f = TInt
            TString  >>= f = TString
	    Fun d r  >>= f = Fun (d >>= f) (r >>= f)
That is, the return operation just lifts a variable name to the term that consists of just that variable, and the bind operation just maps its argument function over the variable names in the term, leaving everything else alone.

Jones wants to write a function, unify, which performs a unification algorithm over these terms. Unification answers the question of whether, given two terms, there is a third term that is an instance of both. For example, consider the two terms a → Int and String → b, which are represented by Fun (TVar "a") TInt and Fun TString (TVar "b"), respectively. These terms can be unified, since the term String → Int is an instance of both; one can assign a = TString and b = TInt to turn both terms into Fun TString TInt.

The result of the unification algorithm should be a set of these bindings, in this example saying that the input terms can be unified by replacing the variable "a" with the term TString, and the variable "b" with the term TInt. This set of bindings can be represented by a function that takes a variable name and returns the term to which it should be bound. The function will have type v → Term v. For the example above, the result is a function which takes "a" and returns TString, and which takes "b" and returns TInt. What should this function do with variable names other than "a" and "b"? It should say that the variable named "c" is "replaced" by the term TVar "c", and similarly other variables. Given any other variable name x, it should say that the variable x is "replaced" by the term TVar x.

The unify function will take two terms and return one of these substitutions, where the substition is a function of type v → Term v. So the unify function has type:

    unify :: Term v → Term v → (v → Term v)
Oh, but not quite. Because unification can also fail. For example, if you try to unify the terms ab and Int, represented by Fun (TVar "a") (TVar "b") and TInt respectively, the unfication should fail, because there is no term that is an instance of both of those; one represents a function and the other represents an integer. So unify does not actually return a substitution of type v → Term v. Rather, it returns a monad value, which might contain a substitution, if the unification is successful, and otherwise contains an error value. To handle the example above, the unify function will contain a case like this:

	unify	TInt	(Fun _ _) = fail ("Cannot unify" ....)
It will fail because it is not possible to unify functions and integers.

If unification is successful, then instead of using fail, the unify function will construct a substitution and then return it with return. Let's consider the result of unifying TInt with TInt. This unification succeeds, and produces a trivial substitition with no bindings. Or more precisely, every variable x should be "replaced" by the term TVar x. So in this case the substitution returned by unify should be the trivial one, a function which takes x and returns TVar x for all variable names x.

But we already have such a function. This is just what we decided that Term's return function should do, when we were making Term into a monad. So in this case the code for unify is:

	unify	TInt	TInt	  = return return
Yep, in this case the unify function returns the return function.

Wheee!

At this point in the paper I was skimming, but when I saw return return, I boggled. I went back and read it more carefully after that, you betcha.

That's my couple of paragraphs. I was planning to get to this point and then say "But that's not what I was planning to discuss. What I really wanted to talk about was...". But I think I'll break with my usual practice and leave the other thing for tomorrow.

Happy Diada Nacional de Catalunya, everyone!

[ Addendum 20080917: Here's the article about the other thing. ]


[Other articles in category /prog] permanent link

Tue, 09 Sep 2008

Factorials are not quite as square as I thought
(This is a followup to yesterday's article.)

Let s(n) be the smallest perfect square larger than n. Then to have n! = a2 - 1 we must have a2 = s(n!), and in particular we must have s(n!) - n! square.

This actually occurs for n in { 4, 5, 6, 7, 8, 9, 10, 11 }, and since 11 was as far as I got on the lunch line yesterday, I had an exaggerated notion of how common it is. had I worked out another example, I would have realized that after n=11 things start going wrong. The value of s(12!) is 218872, but 218872 - 12! = 39169, and 39169 is not a square. (In fact, the n=11 solution is quite remarkable; which I will discuss at the end of this note.)

So while there are (of course) solutions to 12! = a2 - b2, and indeed where b is small compared to a, as I said, the smallest such b takes a big jump between 11 and 12. For 4 ≤ n ≤ 11, the minimal b takes the values 1, 1, 3, 1, 9, 27, 15, 18. But for n = 12, the solution with the smallest b has b = 288.

Calculations with Mathematica by Mitch Harris show that one has n! = s(n!) - b2 only for n in {1, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16}, and then not for any other n under 1,000. The likelihood that I imagine of another solution for n! = a2 - 1, which was already not very high, has just dropped precipitously.

My thanks to M. Harris, and also to Stephen Dranger, who also wrote in with the results of calculations.

Having gotten this far, I then asked OEIS about the sequence 1, 1, 3, 1, 9, 27, 15, 18, and (of course) was delivered a summary of the current state of the art in n! = a2 - 1. Here's my summary of the summary.

The question is known as "Brocard's problem", and was posed by Brocard in 1876. No solutions are known with n > 7, and it is known that if there is a solution, it must have n > 109. According to the Mathworld article on Brocard's problem, it is believed to be "virtually certain" that there are no other solutions.

The calculations for n ≤ 109 are described in this unpublished paper of Berndt and Galway, which I found linked from the Mathworld article. The authors also investigated solutions of n! = a2 - b2 for various fixed b between 2 and 50, and found no solutions with 12 ≤ n ≤ 105 for any of them. The most interesting was the 11! = 63182 - 182 I mentioned already.

[ The original version of this article contained some confusion about whether s(n) was the largest square less than n, or the largest number whose square was less than n. Thanks to Roie Marianer for pointing out the error. ]


[Other articles in category /math] permanent link

Mon, 08 Sep 2008

Factorials are almost, but not quite, square
This weekend I happened to notice that 7! = 712 - 1. Is this a strange coincidence? Well, not exactly, because it's not hard to see that

$$n! = a^{2} - b^{2}\qquad (*)$$

will always have solutions where b is small compared to a. For example, we have 11! = 63182 - 182.

But to get b=1 might require a lot of luck, perhaps more luck than there is. (Jeremy Kahn once argued that |2x - 3y| = 1 could have no solutions other than the obvious ones, essentially because it would require much more fabulous luck than was available. I sneered at this argument at the time, but I have to admit that there is something to it.)

Anyway, back to the subject at hand. Is there an example of n! = a2 -1 with n > 7? I haven't checked yet.

In related matters, it's rather easy to show that there are no nontrivial examples with b=0.

It would be pretty cool to show that equation (*) implied n = O(f(b)) for some function f, but I would not be surprised to find out that there is no such bound.

This kept me amused for twenty minutes while I was in line for lunch, anyway. Incidentally, on the lunch line I needed to estimate √11. I described in an earlier article how to do this. Once again it was a good trick, the sort you should keep handy if you are the kind of person who needs to know √11 while standing in line on 33rd Street. Here's the short summary: √11 = √(99/9) = √((100-1)/9) = √((100/9)(1 - 1/100) = (10/3)√(1 - 1/100) ≈ (10/3)(1 - 1/200) = (10/3)(199/200) = 199/60.

[ Addendum 20080909: There is a followup article. ]


[Other articles in category /math] permanent link

Sat, 12 Jul 2008

runN revisited
Exactly one year ago I discussed runN, a utility that I invented for running the same command many times, perhaps in parallel. The program continues to be useful to me, and now Aaron Crane has reworked it and significantly improved the interface. I found his discussion enlightening. He put his finger on a lot of problems that had been bothering me that I had not quite been able to pin down.

Check it out. Thank you, M. Crane.


[Other articles in category /prog] permanent link

Fri, 11 Jul 2008

Period three and chaos
In the copious spare time I have around my other major project, I am tinkering with various stuff related to Möbius functions. Like all the best tinkering projects, the Möbius functions are connected to other things, and when you follow the connections you can end up in many faraway places.

A Möbius function is simply a function of the form f : x → (ax + b) / (cx + d) for some constants a, b, c, and d. Möbius functions are of major importance in complex analysis, where they correspond to certain transformations of the Riemann sphere, but I'm mostly looking at the behavior of Möbius functions on the reals, and so restricting a, b, c, and d to be real.

One nice thing about the Möbius functions is that you can identify the Möbius function f : x → (ax + b) / (cx + d) with the matrix ${ a\, b \choose c\,d}$, because then composition of Möbius functions is the same as multiplication of the corresponding matrices, and so the inverse of a Möbius function with matrix M is just the function that corresponds to M-1. Determining whether a set of Möbius functions is closed under composition is the same as determining whether the corresponding matrices form a semigroup; you can figure out what happens when you iterate a Möbius function by looking at the eigenvalues of M, and so on.

The matrices are not quite identical with the Möbius functions, because the matrix ${ 1\, 0 \choose 0\,1}$ and the matrix ${ 2\, 0 \choose 0\,2}$ are the same Möbius function. So you really need to consider the set of matrices modulo the equivalence relation that makes two matrices equivalent if they are the same up to a scalar factor. If you do this you get a group of matrices called the "projective linear group", PGL(2). This takes us off into classical group theory and Lie groups, which I have been intermittently trying to figure out.

You can also consider various subgroups of PGL(2), such as the subgroup that leaves the set {0, 1, ∞, -1} fixed. The reciprocal function x → 1/x is one such; it leaves 1 and -1 fixed and exchanges 0 and ∞.

In general a Möbius function has three degrees of freedom, since you can choose the four constants a, b, c, and d however you like, but one degree of freedom is removed because of the equivalence relation—or, to look at it another way, you get to pick b/a, c/a, and d/a however you like. So in general you can pick any p, q, and r and find the unique Möbius function m with m(0) = p, m(1) = q, m(-1) = r. These then determine m(∞), which turns out to be (4qr - 2p(q+r))/(q + r - 2p) when that is defined. And sometimes even when it isn't.

You may be worrying about the infinities here, but it's really nothing much to worry about. f(∞) is nothing more than $\lim_{x\rightarrow\infty} f(x)$.

If (4qr - 2p(q+r))/(q + r - 2p) in the presence of infinities worries you, try a few examples. For instance, consider m : xx+1. This function has p = m(0) = 1, q = m(1) = 2, r = m(-1) = 0. Plugging into the formula, we get m(∞) = -2pq/(q - 2p) = -4 / (2-2) = -4/0 = ∞, which is just right.

The only other thing you have to remember is that +∞ = -∞, because we're really living on the Riemann sphere. Or rather, we're living on the real part of the Riemann sphere, but either way there's only one ∞. We might call this space the "Riemann circle", but I've never heard it called that. And neither has Google, although it did turn up a bulletin board post in which someone else asked the same question in a similar context. There's a picture of it farther down on the right.

Anyway, most choices of p, q, and r in {0, 1, ∞, -1} do not get you permutations of {0, 1, ∞, -1}, because they end up mapping ∞ outside that set. For example, if you take p = 1, q = -1, r = 0, you get m(∞) = -2/3. But obviously the identity function has the desired property, and if you think about the Riemann circle (excuse me, Riemann sphere) you immediately get the rest: any rigid motion of the Riemann sphere is a Möbius function, and some of those motions permute the four points {0, 1, ∞, -1}. In fact, there are eight such functions, because {0, 1, ∞, -1} are at the vertices of a square, so any rigid motion of the Riemann sphere that permutes {0, 1, ∞, -1} must be a rigid motion of that square, and the square has eight symmetries, namely the elements of the group D4:

D4 element m(0) m(1) m(∞) m(-1) m(x) = ? M
Identity 0 1 -1 x
10
01
Rotate
clockwise
1 -1 0 (x + 1) / (x - 1)
11
-11
Rotate 180° -1 0 1 - (1/x)
0-1
10
Rotate
counterclockwise
-1 0 1 (x - 1) / (x + 1)
1-1
11
Reflect
horizontally
0 -1 1 -x
-10
01
Reflect
vertically
1 0 -1 1/x
01
10
Reflect
diagonally (1)
1 0 -1 (-x + 1) / (x + 1)
-11
11
Reflect
diagonally (2)
-1 1 0 (x + 1) / (x - 1)
11
1-1

Here we have eight functions on the reals which make the group D4 under the operation of composition. For example, if f(x) = (x+1)/(x-1), then f(f(f(f(x)))) = x. Isn't that nice?

Anyway, none of that was what I was really planning to talk about. (You knew that was coming, didn't you?)

What I wanted to discuss was the function f : x → 1 / (1 - x). I found this function because I was considering other permutations of {0, 1, ∞, -1}. The f function takes 0 → 1 → ∞ → 0. (It also takes -1 → 1/2, and so is not one of the functions in the D4 table above.) We say that f has a periodic point of order 3 because f(f(f(x))) = x for some x; in this case at least for x ∈ {0, 1, ∞}.

A function with a periodic point of order three is not something you see every day, and I was somewhat surprised that as simple a function as 1/(1-x) had one. But if you do the algebra and calculate f(f(f(x))) explicitly, you find that you do indeed get x, so every point is a periodic point of order 3, or possibly 1.

Or you can do a simpler calculation: since f is the Möbius function that corresponds to the matrix F = ${ \hphantom{-}0\, 1 \choose -1\,1}$, just calculate F3. You get ${ -1\, \hphantom{-}0 \choose \hphantom{-}0\, -1}$, which is indeed the identity function.

This also gives you a simple matrix M for which M7 = M, if you happened to be looking for such a thing.

I had noticed a couple of years ago that this 1/(1-x) function had period 3, and then forgot about it. Then I noticed it again a few weeks ago, and a nagging question came into my mind, which is reflected in a note I wrote in my notebook at that point: "WHAT ABOUT SARKOVSKY'S THEOREM?"

Well, what about it? Sharkovskii's theorem (I misspelled it in the notebook) is a delightful generalization of the "Period three implies chaos" theorem of Li and Yorke. It says, among other things, that if a continuous function of the reals has a periodic point of order 3, then it also has a periodic point of order n for all positive integers n. In particular, we can take n=1, so the function f, which has a periodic point of order 3 must also have a fixed point. But it's quite easy to see that f has no fixed point on the reals: Just put f(x) = 1/(1-x) = x and solve for x; there are no real solutions.

So what about Sharkovskii's theorem? Oh, it only applies to continuous functions, and f is not, because f(1) = ∞. So that's all right.

The Sharkovskii thing is excellent. The Sharkovskii ordering of the integers is:

3 < 5 < 7 < 9 < ...
  < 6 < 10 < 14 < 18 < ...
  < 12 < 20 < 28 < 36 < ...
...
... < 16 < 8 < 4 < 2 < 1.

And the theorem says that if a continuous function of the reals has a periodic point of order n, then it also has a periodic point of order m for all m > n in the Sharkovskii ordering. So if the function has a periodic point of order 2, it must also have a fixed point; if it has a periodic point of order 4, it must also have a periodic point of order 2; if it has a periodic point of order 17, it must also have periodic points of all even orders and all odd orders greater than 17, and so on.

The 1/(1-x) function led me to read more about Sharkovskii's theorem and its predecessor, the "period three implies chaos" theorem. Isn't that a great name for a theorem? And Li and Yorke knew it, because that's what they titled their paper. "Chaos" in this context means the following: say that two values a and b are "scrambled" by f if, for any given d and ε, there is some n for which |fn(a) - fn(b)| > d, and some m for which |fm(a) - fm(b)| < ε. That is, a and b are scrambled if repeated application of f drives a and b far apart, then close together, then far apart again, and so on. Then, if f is a continuous function with a periodic point of order 3, there is some uncountable set S of reals such that f scrambles all distinct pairs of values a and b from S. All that was from memory; I hope it got it more or less correct.

(The Li and Yorke paper also includes an example of a continuous function with a periodic point of order 5 but no periodic point of order 3. It's pretty simple.)

Reading about Sharkovskii's theorem and related matters led me to the web pages of James A. Yorke (of Li and Yorke), and then to the book Chaos: An Introduction to Dynamical Systems that he did with Alligood and Sauer, which is very readable.

I was pleased to finally be studying this material, because it was a very early inspiration to me. When I was about fourteen, my cousin Alex, who is an analytic chemist, came to visit, and told me about period-doubling and chaos in the logistic map. (It was all over the news at the time.) The logistic map is just f : x → λx(1-x) for some constant λ. For small &lambda, the map has a single fixed point, which increases as λ does. But at a certain critical value of λ (λ=3, actually) the function's behavior changes, and it suddenly begins to have a periodic point of order 2. As λ increases further, the behavior changes again, and the periodicity changes from order 2 to order 4. As &lambda increases, this happens again and again, with the splits occurring at exponentially closer and closer values of λ. Eventually there is a magic value of λ at which the function goes berserk and is chaotic. Chaos continues for a while, and then the function develops a periodic point of order 3, which bifurcates...

(The illustration here, which I copied from Wikipedia, uses r instead of λ.)

I was deeply impressed. For some reason I got the idea that I would need to understand partial differential equations to understand the chaos and the logistic map, so I immeditately set out on a program to learn what I thought I would need to know. I enrolled in differential equations courses at Columbia University instead of in something more interesting. The partial differential equations turned out to be a sidetrack, but in those days there were no undergraduate courses in iterated dynamic systems.

I am happy to discover that after only twenty-five years I am finally arriving at the destination.

Cousin Alex also told me to carry a notebook and pen with me wherever I went. That was good advice, and it took me rather less time to learn.


[Other articles in category /math] permanent link

Sun, 29 Jun 2008

Freshman electromagnetism questions: answer 3
Last year I asked a bunch of basic questions about electromagnetism. Many readers wrote in with answers and explanations, which I still hope to write up in detail. In the meantime, however, I figured out the answer to one of the questions by myself.

I had asked:

  1. Any beam of light has a time-varying electric field, perpendicular to the direction that the light is travelling. If I shine a light on an electron, why doesn't the electron vibrate up and down in the varying electric field? Or does it?

And one day a couple of months ago it occurred to me that yes, of course the electron vibrates up and down, because that is how radio antennas work. The EM wave comes travelling by, and the electrons bound in the metal antenna vibrate up and down. When electrons vibrate up and down in a metal wire, it is called an alternating current. Some gizmo at the bottom end of the antenna detects the alternating current and turns it back into the voice of Don Imus.

I thought about it a little more, and I realized that this vibration effect is also how microwave ovens work. The electromagnetic microwave comes travelling by, and it makes the electrons in the burrito vibrate up and down. But these electrons are bound into water molecules, and cannot vibrate freely. Instead, the vibrational energy is dissipated as heat, so the burrito gets warm.

So that's one question out of the way. Probably I have at least three reader responses telling me this exact same thing. And perhaps someday we will all find out together...


[Other articles in category /physics] permanent link

Sun, 01 Jun 2008

Addenda to recent articles 200805

  • Regarding the bicameral mind theory put forth in Julian Jaynes' book The Origin of Consciousness in the breakdown of the Bicameral Mind, Carl Witty informs me that the story "Sour Note on Palayata", by James Schmitz, features a race of bicameral aliens whose mentality is astonishingly similar to the bicameral mentality postulated by Julian Jaynes. M. Witty describes it as follows:

    The story features a race of humanoid aliens with a "public" and a "private" mind. The "public" mind is fairly stupid, and handles all interactions with the real world; and the "private" mind is intelligent and psychic. The private mind communicates psychically with the private minds of other members of the race, but has only limited influence over the public mind; this influence manifests as visions and messages from God.
    This would not be so remarkable, since Jaynes' theories have been widely taken up by some science fiction authors. For example, they appear in Neal Stephenson's novel Snow Crash, and even more prominently in his earlier novel The Big U, so much so that I wondered when reading it how anyone could understand it without having read Jaynes first. But Schmitz's story was published in 1956, twenty years before the publication of The Origin of Consciousness.

  • Also in connection with Jaynes: I characterized his theory as "either a work of profound genius, or of profound crackpottery". I should have mentioned that this characterization was not lost on Jaynes himself. In his book, he referred to his own theory as "preposterous".

  • Many people wrote in with more commentary about my articles on artificial Finnish [1] [2]:
    • I had said that "[The one-letter word 'i'] appears in my sample in connection with Sukselaisen I hallitus, whatever that is". Several people explained that this "I" is actually a Roman numeral 1, denoting the ordinal number "first", and that Sukselaisen I hallitus is the first government headed by V. J. Sukselaisen.

      I had almost guessed this—I saw "Sukselaisen I" in the source material and guessed that the "I" was an ordinal, and supposed that "Sukselaisen I" was analogous to "Henry VIII" in English. But when my attempts to look up the putative King Sukselaisen I met with failure, and I discovered that "Sukselaisen I" never appeared without the trailing "hallitus", I decided that there must be more going on than I had supposed, as indeed there was. Thanks to everyone who explained this.

    • Marko Heiskanen says that the (fictitious) word yhdysvalmistämistammonit is "almost correct", at least up to the nonsensical plural component "tammonit". The vowel harmony failure can be explained away because compound words in Finnish do not respect the vowel harmony rules anyway.

    • Several people objected to my program's generation of the word "klee": Jussi Heinonen said "Finnish has quite few words that begin with two consonants", and Jarkko Hietaniemi said "No word-initial "kl":s possible in native Finnish words". I checked, and my sample Finnish input contains "klassisesta", which Jarkko explained was a loanword, I suppose from Russian.

      Had I used a larger input sample, oddities like "klassisesta" would have had less influence on the output.

    • I acquired my input sample by selecting random articles from Finnish Wikipedia, but my random sampling was rather unlucky, since it included articles about Mikhail Baryshnikov (not Finnish), Dmitry Medvevev (not Finnish), and Los Angeles (also not Finnish). As a result, the input contained too many strange un-Finnish letters, like B, D, š, and G, and so therefore did the output. I could have been more careful in selecting the input data, but I didn't want to take the time.

      Medvedev was also the cause of that contentious "klassisesta", since, according to Wikipedia, "Medvedev pitää klassisesta rock-musiikista". The Medvedev presidency is not even a month old and already he has this international incident to answer for. What catastrophes could be in the future?

    • Another serious problem with my artificial Finnish is that the words were too long; several people complained about this, and the graph below shows the problem fairly clearly:

      The x-axis is word length, and the y-axis is frequency, on a logarithmic scale, so that if 1/100 of the words have 17 letters, the graph will include the point (17, -2). The red line, "in.dat", traces the frequencies for my 6 kilobyte input sample, and the blue line, "pseudo.dat", the data for the 1000-character sample I published in the article. ("Ävivät mena osakeyhti...") The green line, "out.dat", is a similar trace for a 6 kb N=3 text I generated later. The long right tail is clearly visible. My sincere apologies to color-blind (and blind) readers.

      I am not sure exactly what happened here, but I can guess. The Markov process has a limited memory, 3 characters in this case, so in particular is has essentially no idea how long the words are that it is generating. This means that the word lengths that it generates should appear in roughly an exponential distribution, with the probability of a word of length N approximately equal to $\lambda e^{-\lambda N} $, where 1/λ is the mean word length.

      But there is no particular reason why word lengths in Finnish (or any other language) should be exponentially distributed. Indeed, one would expect that the actual distribution would differ from exponential in several ways. For example, extremely short words are relatively uncommon compared with what the exponential distribution predicts. (In the King James Bible, the most common word length is 3, then 4, with 1 and 8 tied for a distant seventh place.) This will tend to push the mean rightwards, and so it will skew the Markov process' exponential distribution rightwards as well.

      I can investigate the degree to which both real text and Markov process output approximate a theoretical exponential distribution, but not today. Perhaps later this month.

    My thanks again to the many helpful Finnish speakers who wrote in on these and other matters, including Marko Heiskanen, Shae Erisson, Antti-Juhani Kaijanaho, Ari Loytynoja, Ilmari Vacklin, Jarkko Hietaniemi, Jussi Heinonen, Nuutti-Iivari Meriläinen, and any others I forgot to mention.

  • My explanation of Korean vowel harmony rules in that article is substantively correct, but my description of the three vowel groups was badly wrong. I have apparently forgotten most of the tiny bit I once knew about Middle Korean. For a correct description, see the Wikipedia article or this blog post. My thanks to the anonymous author of the blog post for his correction.

  • Regarding the transitivity of related-by-blood-ness, Toth András told me about a (true!) story from the life of Hungarian writer Karinthy Frigyes:

    Karinthy Frigyes got married two times, the Spanish flu epidemic took his first wife away. A son of his was born from his first marriage, then his second wife brought a boy from his previous husband, and a common child was born to them. The memory of this the reputed remark: "Aranka, your child and my child beats our child."

    (The original Hungarian appears on this page, and the surprisingly intelligible translation was provided by M. Toth and the online translation service at webforditas.hu. Thank you, M. Toth.

  • Chung-chieh Shan tells me that the missing document-viewer feature that I described is available in recent versions of xdvi. Tanaeem M. Moosa says that it is also available in Adobe Reader 8.1.2.


[Other articles in category /addenda] permanent link

Sat, 31 May 2008

Defunctionalization and Java
A couple of weeks ago I was introduced to the notion of defunctionalization by this article on Ken Knowles' blog. Defunctionalization is a program transformation that removes the higher-order functions from a program. The idea is that you replace something like λx.x+y with a data structure that encapsulates a value of y somewhere, say (HOLD y). And instead of using the language's built-in function application to apply this object directly to an argument x, you write a synthetic applicator that takes (HOLD y) and x and returns x + y. And anyone who wanted to apply λx.x+y to some argument x in some context in which y was bound should first construct (HOLD y), then use the synthetic applicator on (HOLD y) and x.

Consider, for example, the following Haskell program:

        -- Haskell
        aux f = f 1 + f 10
        res x = aux (λz -> z + x)
The defunctionalization of this example is:

        -- Haskell
        data Hold = HOLD Int
        fake_apply (HOLD a) b = a + b
        aux held = fake_apply held 1 + fake_apply held 10
        res x = aux (HOLD x)
I hope this will make the idea clear.

M. Knowles cites the paper Defunctionalization at work by Olivier Danvy and Lasse R. Nielsen, which was lots of fun. (My Haskell example above is a simplification of the example from page 5 of Danvy and Nielsen.) Among other things, Danvy and Nielsen point out that this defunctionalization transformation is in a certain sense dual to the transformation that turns ordinary data structures into λ-terms in Church encoding. Church encloding turns data items like pairs or booleans into higher-order functions; defunctionalization turns them back again.

Section 1.4 of the Danvy and Nielsen paper lists a whole bunch of contexts in which this technique has been studied and used, but one thing I didn't think I saw there is that this is essentially the transformation that Java programmers use when they want to use closures.

For example, suppose a Java programmer wants to write something like aux in:

        -- Haskell
        aux f = f 1 + f 10
        res x = aux (λz -> z + x)
But they can't, because Java doesn't have closures.

So instead, they do this:

        /* Java */

        class Hold {
          private int a;

          public Hold(int a) {
            this.a = a;
          }

          public int fake_apply(int b) {
            return this.a + b;
          }
        }

        private static int aux(Hold h) {
          return h.fake_apply(1) + h.fake_apply(10);
        }

        static int res(int x) {
          Hold h = new Hold(x);
          return aux(h);
        }
Where the class Hold corresponds directly to the data type Hold in the defunctionalized Haskell code.

Here is a real example. Consider GNU Emacs. When I enter text-mode in Emacs, I want a bunch of subsystems to be notified. Emacs has a text-mode-hook variable, which is basically a list of functions, and when an Emacs buffer is put into text-mode, Emacs invokes the hooks. Any subsystem that wants to be notified puts its own hook function into that variable. If I wanted to accomplish something similar in Haskell or SML, I would similarly use a list of functions.

In Java, the corresponding facility is called java.util.Observable. Were one implementing Emacs in Java (perish the thought!) the mode object would inherit from Observable, and so would provide an addObserver method for adding a hook to a list somewhere. When the mode was switched to text-mode, the mode object would call notifyObservers, which would loop over the hook list, calling the hooks. So far this is just like Emacs Lisp.

But in Java the hooks are not functions, as they are in Emacs, because in Java functions are not first-class entities. Instead, the hooks are objects which conform to the Observer interface specification, and instead of invoking functions directly, the notifyObservers method calls the update method on each hook object.

Here's another example. I wrote a recursive descent parser in Java a while back. An ActionParser is just like a Parser, except that if its parse succeeds, it invokes a callback. If I were programming in SML or Haskell or Perl, an ActionParser would be nothing but a Parser with an associated closure, something like this:

        # Perl        
        package ActionParser;

        sub new {
          my ($class, $parser, $action) = @_;
          bless { Parser => $parser,
                  Action => $action } => $class;
        }

        # Just like the embedded parser, but invoke the action on success
        sub parse {
          my $self = shift;
          my $input = shift;
          my $result = $self->{Parser}->parse($input);
          if ($result->success) 
            $self->{Action}->($result);   # Invoke action
          }
          return $result;          
        }
Here the Action member is expected to be a closure, which is automatically invoked if the parse succeeds. To use this, I would write something like this:

        # Perl        
        my $missiles;        
        ...
        my $parser = ActionParser->new($otherParser, 
                                       sub { $missiles->launch() }
                                      );
        $parser->parse($input);
And then if the input parses correctly, the parser launches the missiles from the anonymous closure, which has captured the local $missiles object.

But in Java, you have no closures. Instead, you defunctionalize, and represent closures with objects:

        /* Java */
        abstract class Action {
          void invoke(ParseResults results) {}
        }

        class ActionParser extends Parser {
          Action action;
          Parser parser;

          ActionParser(Parser p, Action a) {
            action = a;
            parser = p;
          }

          ParseResults Parse(Input input) {
            ParseResults res = this.parser.Parse(input);
            if (res.isSuccess) {
              this.action.invoke(res);
            }
            return res;
          }
        }
To use this, one writes something like this:

        /* Java */

        class LaunchMissilesAction extends Action {
          Missiles m;

          LaunchMissilesAction(Missiles m) { this.m = m; }
          void invoke(ParseResults results) {
            m.launch();
          }
        }

        ...

        Action a = new LaunchMissilesAction(missiles);
        Parser p = new ActionParser(otherParser, a);
        p.parse(input);
The constructor argument missiles takes the place of a free variable in a closure. The closure itself has been replaced with an object from an ad hoc class, just as in Danvy and Nielsen's formulation, the closure is replaced with a synthetic data object that holds the values of the free variables. The invoke method plays the role of fake_apply.

Now, it's not a particularly interesting observation that this can be done. The interesting part, I think, is that this is what Java programmers actually do. And also, perhaps, that Danvy and Nielsen didn't mention it in their paper, because I think the technique is pretty widespread.


[Other articles in category /prog] permanent link