Csaba's coding experiences: 2012

2012. november 23., péntek

Interrelated Generic Types

When designing software, after various amount of thought and work, I can usually arrive at a solution that seems elegant, reliable, I dare say beautiful. The structure of the class hierarchy is very straightforward and closely mirrors the mental image of the problem, making it easy to understand and handle. There are some cases, however, where I seem to be stuck with something, and I expect that it could be done better but I don't know how. Maybe I am not intelligent enough to find the 'proper' solution, or maybe the language simply lets me down. I have two concrete examples in mind, and in this entry I present the one involving generics.

The example comes from my GameCracker project, but you don't need to be familiar with it to understand the problem. I'll present everything in as much detail as I deem necessary to understand what I want to convey.

Designing a Game API

For modeling a game, I need a type that represent the complete state of the game at a moment in time. I call this type Position. In each position, there may be actions that can be perfomed; these are the Moves. The Position specifies the set of possible Moves, and performing one of these moves leads to a different Position. In code, it looks like this:

interface Move {}

interface Position {
List<Move> getMoves();
Position move(Move move);
}

This is the API, and there are implementations for different games. For example, ChessMove and ChessPosition, or TicTacToeMove and TicTacToePosition.

The types that the API defines are implicitly interrelated, the classes for a single implementation belong together. Like ChessPosition.getMoves will always return objects of type ChessMove; and it makes no sense to try to apply a TicTacToeMove to a ChessPosition. These kinds of constraints can be enforced using generics:

interface Move {}

interface Position<M extends Move> {
List<M> getMoves();
Position<M> move(M move);
}

And the implementation:

class ChessMove implements Move {...}

class ChessPosition implements Position<ChessMove> {
public List<ChessMove> getMoves() {...}
public Position<ChessMove> move(ChessMove move) {...}
}

More Types

That looks fine. What would be even prettier is if the move function returned ChessPosition; there's nothing keeping us to override the move function with this more concrete type, but to enforce this constraint through the type system, we need a type variable for the implementation type of Position.

There are other reasons to have this type variable. In a game, there can be transformations between positions, for example when two positions are the same except that the board is mirrored. So there is a function that compares positions, and another one that applies a transformation to a position. Similarly to how only ChessMoves should be applied to ChessPositions, it makes no sense to try to compare positions of different games.

What the 'transformations' actually mean depends on the game itself and are perhaps not simple symmetries. So this is also a type that is associated with the game implementation. Putting it all together, we have:

interface Transformation {}

interface Move {}

interface Position, M extends Move, T extends Transformation> {
List<M> getMoves();
P move(M move);
T getTransformationTo(P pos);
P transform(T transformation);
}

Now we have three different type variables, but it all makes sense. For a given game, P is the type of the game's positions, M is the type of the game's moves, and T is the type of the transformations that make sense for the game. Now, it turns out that I need some other functions: I need to be able to apply transformations to moves, and I need to compose transformations. All this with the type constraints that a game's move can only be transformed by the game's transformation, and only transformations of the same kind can be composed. With these additions we arrive at the current API used in GameCracker:

interface Transformation<T extends Transformation<T>> {
T compose(T other);
}
interface Move<M extends Move<M,T>, T extends Transformation<T>> {
M transform(T transformation);
}
interface Position, M extends Move<M,T>, T extends Transformation<T>> {
List<M> getMoves();
P move(M move);
T getTransformationTo(P pos);
P transform(T transformation);
}

Those are a lot of generics, although it still makes perfect sense, and implementing this API results in the right types everywhere. Given a set of implementation types, e.g. ChessMove, ChessPosition, and SquareSymmetry, ChessPosition.transform accepts SquareSymmetry and returns ChessPosition, ChessMove.transform accepts SquareSymmetry and return ChessMove etc. The generics have achieved their goal: no casts are necessary. The cost is that code dealing with the API types must carry around the type parameters. Example:

class Match, M extends Move<M,T>, T extends Transformation<T>> {
 List positions;
 List<M> moves;
 ...
 public P getCurrentPosition() {
return positions.get(positions.size()-1);
 }
public void move(M move) {
if (!getCurrentPosition().getMoves().contains(move))
 throw new IllegalArgumentException("Invalid move");
 positions.add(getCurrentPosition().move(move));
 moves.add(move);
 }
}

All the classes or functions that reference a position require the generic baggage of ", M extends Move<M,T>, T extends Transformation<T>>" to be type-safe. Looking at this, I kind of start to doubt if this is the right thing to do, even though technically it is correct, it expresses what I need it to express. The other option I see is not using generics; that would mean casts all over the implementation classes, and nothing preventing users of the API to mix up types belonging to different games. I like this less.

But the rabbit-hole goes deeper.

Introducing Graphs

I need graphs that store game positions in their nodes. I need two different kinds of nodes: normal nodes and transformation nodes. I can imagine several different graph implementations, so I need an API similar to the game API, and once again the implementation types of the API types will have interdependencies. If I use the same approach of defining type parameters, then I will need the following types:

N: the type of the nodes (parent type)
NN: the type of the normal nodes
TN: the type of the transformation nodes

interface Graph<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> {
 NN getRoot();
}
interface Node<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> {
 List<N> getParents();
 NN asNormalNode();
 TN asTransformationNode();
}
interface NormalNode<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> extends Node<N,NN,TN> {
 List<N> getChildren();
}
interface TransformationNode<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> extends Node<N,NN,TN> {
 NN getChild();
}

A similar generic baggage (<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>>) defines the implementation types.

This achieves the same that I got with the game API, although here I also need to ensure that mutator functions (which are not displayed in this example) cannot be used to mix up nodes of the same type but belonging to different graphs. Of course the type system cannot help with this one, but that's fine.

What's less fine is that a graph is also tied to a game. I am storing positions in the nodes after all; but positions that belong to the same game. This means that to the N-NN-TN type trio, I need to add P-M-T from the game API, making the generic parameter list of the graph types this monstrosity:

<N extends Node<N,NN,TN,P,M,T>,
NN extends NormalNode<N,NN,TN,P,M,T>,
TN extends TransformationNode<N,NN,TN,P,M,T>,
P extends Position<P,M,T>,
M extends Move<M,T>,
T extends Transformation<T>>

And whenever I reference a node of a graph, I'm carrying this list with myself. Now this is too much for me.

I don't really have a good option when it comes to avoiding the P-M-T types in the nodes, because I really cannot refer to positions without them very well. What would Node.getPosition return without them? It could return Position<?,?,?>, but then the code that gets this result cannot do anything sensible with it. Without the move type, the compiler will not let me call Position.move, with the argument type being ? extends Move<?,?>. I could use raw types in the clients of the API, but that's something I don't even want to consider.

Right now I consider my best route of action to be avoiding the generic types for the graph nodes: N-NN-TN. This means that the graph types only need to carry P-M-T, but I need to do casts in the graph implementations. Although if I do that, why not remove P-M-T as well, and do casts in the game implementations? If using generics in the game API is something I want to do, then using generics in the graph API is a natural extension. How can I justify using one but not the other? I don't really know what I'll end up doing.

None of the options I can think of seem to be 'the right way'. I'm not even sure if there is one.

2012. szeptember 28., péntek

GameCracker - 9. Correctness and Trust

When dealing with anything but the simplest game (Tic-Tac-Toe or 4x4 reversi), the program takes up a lot of computing power, disk space, and time. How can I be sure that it even does the right thing, what I intended for it to do?

I can't, of course. All programs, even much simpler than GameCracker, are bound to have bugs. And since I have no way of making sure that GameCracker has none, and I want to set it to work anyway, I just have to suspend my doubts and assume that everything is all right. At least until I have reason to believe otherwise. Because as soon as I have the slightest suspicion that something's wrong, I lose all confidence in the graph I have built so far, even if I'm not sure if it's actually incorrect. (There's no practical way to thoroughly check the data for consistency problems, it would take forever. Maybe not so long as building it, but still...)

For example, in my chess module, I have some test cases that commemorate bugs otherwise long forgotten. One is related to the rule I use for avoiding infinite matches: if a state repeats, then it is a draw. A little bit stricter than the official rules, real games could have repeats, but then I say that my graph still answers the 'who will win' question just fine, if we ignore those moves between the repeating states; those don't matter in any way. So in my model, the chess state comprises of not only the board, and not only the extra information that relates to the availability of some special moves (like castling and en passant) that cannot be deduced from the board itself, but also all those past boards that have a chance of repeating. I can clear this 'past states' set whenever something irreversible happens, like a pawn moves, a piece is captured, or one of the players loses the ability for a castling or an en passant move, but otherwise these matter, and even if two states are the same with regards to the board, the player to move, and all those special moves, if this set is different, that means that I cannot consider these states equivalent. Quite unfortunate it is, but what can you do. You can list a series of moves that leads to a repeating state from one of them, causing a draw, but these same moves applied in the other one result in a new board setup, and the game goes on. The difference seems slight from a common chess player's point of view, but it is actually huge. That sequence of moves leads to a draw from one state, and to a possibly different result from the other, and this can cause the result of these two 'similar' states to be entirely different too. I think I'm overexplaining things, so time to get back to my example.

Long ago I used to have this bug that I only stored the boards itself for the 'past states', and that caused my model to call draws when it shouldn't have. It turned out that it very much matterd who the player to move was in those past states. If we find ourselves at a board that we saw before, but now it is the other player's move, then it is not a repeat, it is not a cycle in the graph. So I had to add the 'player to move' information along with the board in the 'past states'. Needless to say, this invalidated the entire graph, I had to throw it out. I see another test related the interaction with repeating states and en passant, and a third one making sure that if a pawn moves only one square forward, then it cannot be captured en passant. Bugs like these happen all the time, but once you've found many of them, and time passes, and the program seems to work right, you can grow to trust that there are no more problems, and that the graph and everything is valid and correct. Until you notice something that's not.

A couple of days ago I wrote a little routine to count how many nodes were in each category. I started to notice that some category chains were getting long, and it took several seconds to load and go through them. If there were a few chains which were significantly longer than the rest, then maybe a better category algorithm could be introduced which would even out the lengths. So that was the motivation. To gather and count the category lengths, I first had to gather all the categories. With the category head store described in the previous entry, the only way to do this was by going over all the hash tables, all the arrays from the beginning to the end, and fetching the category number from the non-empty cells. This is quite time-consuming actually, the category head store file being about 11 GB in size, but memory mapping helps a lot with such things. Now then, as far as I was doing this, I counted how many category heads I've found in the store; this is the number of categories that I have. Of course this number can be computed from metadata of the store (FullBlockCount * MaxBlockSize + LastBlockSize), so I compared these numbers. And they were different.

I have found an inconsistency in the data. This basically means that all the previous work is ruined. The category heads store is incorrect, who knows what else is wrong. Unless I find the cause of the problem, and it turns out that the graph itself is not affected (very unlikely), I have to throw it all away. And this is a bit of a tragedy. Here's why.

In this run, building the chess graph, I have accumulated about 280 million nodes; the entire data structure is now 110 GB, and the total amount of time (real time) my computer spent on it, though I didn't measure it so don't know the exact number, it is definitely over 250 days. Welp!

I looked at the code for the category head store long and carefully, to see where I increment the metadata and where I put in new elements (there are no removals, so it is not that complex). This inspection gave me no clues. I then checked the category store for any duplicates. (Since it is a map, it shouldn't have any.) I saw the category -17 occured twice. Both occurances had the value 0 associated with them, which is the invalid value that I use in empty cells. And of course I make sure that the invalid value is not accepted by the store. Thinking that maybe I was incorrectly indexing something in the file, I looked at the cell surrounding both these invalid cells. The cells immediately after them were empty, and the ones before them had values in them. I loaded up these states and checked if their categories were those that were stored; they were. Some further luck led me to notice, however, that if I take one of these states, navigate up to its parent node, and in this parent state, I apply the corresponding move, I get a state that is different from what I started with. The board is right, but one has two 'past states', and the other has one. The graph is definitely wrong.

I will have to investigate further, try to find where the bug is. But I also think that I will rewrite the program, trying to build it up better so that I can try and various actual database solutions to store my graph beside my own approach. Compare which is better. I will publish the source code this time, on GitHub, which I'm pretty sure will lead to better design and documentation. Of course I make absolutely no promises whatsoever.

Update: I have found the bug with the chess states. I had changed how the past states are stored, and I forgot half of the 'addPastState' function... This change was so simple that I didn't even think that I could have made a mistake. Well, let this be a lesson: making mistakes is very easy, and it is worth having more thorough tests.

The issue with the category head store, though, is still a mystery. I've looked at the code for such a long time that it now seems simple. I have to assume that I have manipulated the file outside the normal program flow, and made a mistake there, or otherwise my hard drive is malfunctioning. Either way, the program is running once again with recreated data, and I have inserted a whole bunch of runtime sanity tests (like when iterating over a category chain looking for a state, I check the category of each state to see if it is the same as the chain it is in). The minor, pretty much undetectable performance hit can be offset by detecting problems much earlier than otherwise. I suggest it to everyone handling sensitive data structures.

2012. szeptember 25., kedd

GameCracker - 8. Category Heads

When we consider the execution time of the program, the step that can be expected to take the most time is checking to see if a state is already in the graph. That's why I introduced the concept of categories. As a reminder, a category is a hash that limits which states have a possibility to be equal to the new one. For each category, I have a linked list of the nodes which contain states in this category; these category chains are the nodes which I need to check during expansion. I also described that each node contains a link to the next node in the same category chain, so these links are neatly contained in the main data structure. However, I need to know which node is the first in a particular chain. So for the category chains to work, I need an additional data structure that tells me the ID of the first node in a category chain (the category head) given the category. Since both the category itself and the node ID are long values, this data structure is essentially a long-long map.

If the category does a nice job limiting the number of nodes that need inspecting, that means that there are a lot of categories. The time required to look up a category head can become an issue as the number of category chains grows.

With that in mind, I chose a file-based hash table to store the category heads. If I remember correctly, I took a hash map implementation (not the chained java.util.HashMap, but one using open addressing) and adapted it a bit. Here's how it works.

Open addressed hash maps store their elements in arrays; by a hashing function, each key is assigned a position in the array. If the position is already occupied, a free position is found via some probing method. As new elements are added to the map, the array fills up, and collisions become more common; there are more and more elements that require more and more steps to find. This ruins the performance of the data structure. To remedy this, the entire hash map is recreated using a larger array.

This recreation is something that I couldn't fit in my file-based transactional data storage system. Even if I could somehow manage to make it work, it would probably take a lot of time to do it in the file system. So I decided for a compromise. Instead of reallocating the entire array, I append a new array at the end of the file. So my category head store comprises of a sequence of these arrays, which I call blocks. The last block is the one in which new elements are added, until it fills up and a new block is appended. The price of this strategy is that when searching for a key, every single block must be examined to see if it contains the key. So there's the additional price, an overhead besides the probing (caused by collisions).

This is the data structure that I use now. I'd really love to find a better one though, one with better performance and more pleasant growing abilities. Recently I've been wondering how databases would do this. If I created a table with two 'long' columns, the category (which would be the primary key) and the head ID, how would it perform compared to my hash map? Does it have some tricks that I could learn? One of these days I'll take a closer look.

2012. március 25., vasárnap

Generating a Random Combination

Recently I've had my mind on lottery drawing, and I suddenly thought of a way to generate a random combination that is better than how I used to do it.

Here's the basic problem. Given the set of numbers (0, 1, ... n-1), we want to generate a random subset of k elements of this set of n. The random generator function that I can use is Random.nextInt(n), which returns a number between 0 and n-1. Now, if we have already generated some of the numbers, (c[0], c[1], ..., c[i-1]), then how do we generate the next one?

The way I traditionally do it is that I keep generating a random number between 0 and n-1 until I get one which is different from the ones I already have:

List<Integer> numbers=new ArrayList<Integer>();
while (numbers.size()<k) {
 int next=rnd.nextInt(n);
 if (!numbers.contains(next))
 numbers.add(next);
}

I've kept wondering if there if a way to do it that doesn't involve an unbounded loop, and there is. Think of how lottery numbers are drawn. Once a number is selected, it is discarded from the set of possible numbers, and the next one is then chosen uniformly from the rest. We can do something quite similar in code!

Once again, suppose that we have already generated i numbers, and we're looking for the (i+1)-th one. There are (n-i) candidates left to choose from, the only problem is that they're not the numbers from 0 to n-i-1 (which is the interval in which we can generate one of (n-i) numbers). But we can map these two sets! Let me demonstrate on an example. Consider the numbers between 0 and 9, and suppose we have already chosen 3, 4, and 7. Here's our problem:

The numbers we need to choose from:	0 1 2 5 6 8 9
The numbers we can generate from:	0 1 2 3 4 5 6

Notice that we can collapse the first row, giving us a mapping from the 7 numbers between 0 and 6 to the 7 numbers between 0 and 9 but excluding the ones we've already chosen. The question is: if we generate one in the former set, how can we get the actual number it corresponds to? Here's how:

List<Integer> numbers=new ArrayList<Integer>();
for (int i=0; i<k; i++) {
 int next=rnd.nextInt(n-i);
 int nextIndex;
 for (nextIndex=0; nextIndex<i && numbers.get(nextIndex)<=next; nextIndex++)
 next++;
 numbers.add(nextIndex, next);
}

Note that now we maintain the list of selected numbers in ascending order. What we basically do is shift our generated number to the right (increment it) as many times as we need to account for the 'empty slots'. For example, suppose that in the scenario depicted above, we get the random number 3. Since the first number already selected (which is 3) is <= the new number, we increment it to get 4. The next number already selected (4) is once again <= our number, so we increment it again, and get 5. The next number in the list is 7, so we don't increment any more. We've found the number among the remaining ones which corresponds to the generated number 3 (or, to put it in another way, the fourth one among the remaining candidates): 5. We similarly need two increments if we start from the number 4. If the generator gives us 5, then after two increments, we find that the next already chosen number, 7, causes us to increment a third time, giving us 8. And so on.

As a nice side-effect, we can notice that the number of times we needed to increment gives us the index where we have to insert the new number to keep the list sorted (since it is equal to how many numbers we already have which are less than this new one).

I like this algorithm much more than the original one. We can give an upper limit to its execution time, so we don't run into problems even when k is close to n. I'm pretty sure that I'm not the first one to think of this way of generating combinations, but I wonder how come I've never seen this method before.

P.S. There's another, quite pretty approach on Wikipedia. That one is similarly easy to implement, has bound run-time, and I've also not heard of that one before.