Csaba's coding experiences

2013. március 22., péntek

Parallel Hierarchies

Back in the day I mused about how generics can get out of control if you have interrelated types in an API and you use generics to enforce type-safety. The Java type system does not handle that situation gracefully. This time I'll describe a different kind of problem where the language let me down a little bit.

We still have types in an API that will be implemented, but now, rather than type interrelations, the focus will be on type hierarchies. Although I could once again turn to the Node types in GameCracker for illustration, I will mix it up a bit and demonstrate my difficulties using my Matrix library.

The Matrix types

The gimmick of this library is that I wanted a type-rich linear algebra package; apart from the general Matrix type, I wanted a subtype for Vectors (1-column matrices), 3x3 matrices, 3-element vectors etc. These specialized types could define methods appropriate for that particular matrix (for example, 3D vectors have a function called cross product), and, the primary reason, common operations like addition and transposition can be defined in subtypes to return a more specific type. Just in case I'm talking nonsense, here is a hopefully helpful example, only focusing on two types and just a couple of methods for the sake of simplicity:

public interface Matrix {
    /** Returns the number of columns. */
    public int getColumns();

    /** Returns the number of rows. */
    public int getRows();

    /** Returns the (row,col) element. */
    public double get(int row, int col);

    /** Adds this matrix and m, and returns the result. */
    public Matrix plus(Matrix m);
}

public interface Vector extends Matrix {

    /** Returns the number of components. */
    public int getDimension();

    /** Returns the i-th component. */
    public double getCoord(int i);

    /** Adds this vector and v, and returns the result. */
    public Vector plus(Matrix m);

    /** Compute the dot product of this vector and v. */
    public double dot(Vector v)}

Technically everything could be done with the single Matrix type, but being able to use and work with subtypes help readability a lot when doing math. Readability is really-really important. Working with 3D points is so much easier when you have a List<Vector3> rather than a List<Matrix>, not to mention that the Vector3 type can have very nice getX(), getY(), getZ() functions.

Implementing the Types

There are a number of possible implementations for the Matrix API. The most straightforward is one that uses an array to store the elements. Note that the implementation must provide concrete classes for every type: it must have 2D vectors, 3D vectors, it must have vectors that are neither, and it must have matrices that are not vectors. The parent types are not just there to define common functionality to the subtypes.

public class ArrayMatrix implements Matrix {
private double[][] data;
@Override
public int getColumns() {
return data[0].length;
 }
 @Override
 public int getRows() {
return data.length;
 }
@Override
 public double get(int row, int col) {
 return data[row][col];
 }
 @Override
 public Matrix plus(Matrix m) {
if (m.getRows()!=getRows() || m.getColumns()!=getColumns())
throw new IllegalArgumentException("size mismatch");
Matrix result=MatrixFactory.create(m.getRows(), m.getColumns());
for (int row=0; row<getRows(); row++)
for (int col=0; col<getColumns(); col++)
result.set(row, col, get(row, col)+m.get(row, col));
 }
}
public class ArrayVector extends ArrayMatrix implements Vector {
@Override
public int getDimension() {
return getRows();
}
@Override
public double getCoord(int i) {
return get(i, 0);
}
@Override
public Vector plus(Matrix m) {
return (Vector)super.plus(m);
}
 @Override
 public double dot(Vector v) {
 if (getDimension()!=v.getDimension())
 throw new IllegalArgumentException("size mismatch");
 double result=0;
 for (int i=0; i<getDimension(); i++)
 result+=getCoord(i)*v.getCoord(i);
 return result;
 }

}

A couple of disclaimers before I proceed.

First, this does not exactly reflect what I have in my library in terms of names and such. My goal is not to be faithful to the source material, but to provide a simple example.
Second, I just omit constructors and functions like Matrix.set, hoping that no-one will seriously miss them during this discussion.
Third, about the creation of the matrix instances. As part of the gimmick, I want every matrix that is created through operations throughout the library to have the most specific type. For example, if a matrix has only one column, then it needs to be a Vector, so that the user can cast it to one and take advantage of the subtype. Since the API is public, I have no way of enforcing this, but I have a MatrixFactory class that does take care of this constraint, and everywhere in my library I use this factory whenever I need an instance.
All different kinds of matrix implementations have their own factory because of this, but I wanted general API functions that create new instances (like Matrix.plus) to create one of the default implementation, rather than of the same implementation as the 'creator'. The reason for this is to avoid surprises. Imagine that you get two matrices, add them together, try to modify the result, and you get an exception because the first original matrix was actually an ImmutableMatrix, and its plus method also returned an ImmutableMatrix. That would make using the library rather unpleasant; that's why whenever a matrix needs to be created in the library, the default MatrixFactory is used.

Okay, with all these marginally important stuff out of the way, let me direct your attention to the point. Have a look at the ArrayMatrix.plus function. It doesn't actually have anything in it that is specific to this particular matrix implementation kind. In fact this code works for all imaginable implementations. The same is true for the functions in ArrayVector, and in fact for the whole bunch of omitted functions (multiply, submatrix, dot product, matrix norm etc.) Of course it is possible that some special kind of matrix can provide a more efficient implementation to some of these, but they are always a reasonable default. All functions apart from the core implementation-specific Matrix.get, Matrix.set, Matrix.getColumns, and Matrix.getRows will be shared by most subtypes of Matrix.

API as Abstract Classes

To share code with subtypes, you stick that code into a common supertype. Usually. To do that here, we need the API types to be abstract classes rather than interfaces:

public abstract class Matrix {

 /** Returns the number of columns. */
 public abstract int getColumns();

 /** Returns the number of rows. */
 public abstract int getRows();

 /** Returns the (row,col) element. */
 public abstract double get(int row, int col);

 /** Adds this matrix and m, and returns the result. */
 public Matrix plus(Matrix m) {
if (m.getRows()!=getRows() || m.getColumns()!=getColumns())
throw new IllegalArgumentException("size mismatch");
Matrix result=MatrixFactory.create(m.getRows(), m.getColumns());
for (int row=0; row<getRows(); row++)
for (int col=0; col<getColumns(); col++)
result.set(row, col, get(row, col)+m.get(row, col));
 }

 // other non-abstract functions
}
public abstract class Vector extends Matrix {
@Override
public int getDimension() {
return getRows();
}
@Override
public double getCoord(int i) {
return get(i, 0);
}
@Override
public Vector plus(Matrix m) {
return (Vector)super.plus(m);
}
 public double dot(Vector v) {
 if (getDimension()!=v.getDimension())
 throw new IllegalArgumentException("size mismatch");
 double result=0;
 for (int i=0; i<getDimension(); i++)
 result+=getCoord(i)*v.getCoord(i);
 return result;
 }
}

That's all nice and good, ArrayMatrix can inherit the matrix operations from Matrix, ArrayVector can inherit the vector operations from Vector.

public class ArrayMatrix extends Matrix {
private double[][] data;
@Override
public int getColumns() {
return data[0].length;
    }
    @Override
    public int getRows() {
return data.length;
    }
@Override
    public double get(int row, int col) {
        return data[row][col];
    }

    // all the other functions we'll just inherit
}
public class ArrayVector extends Vector {
    // aww, we no longer extend ArrayMatrix
private double[][] data;
@Override
public int getColumns() {
return data[0].length;
    }
    @Override
    public int getRows() {
return data.length;
    }
@Override
    public double get(int row, int col) {
        return data[row][col];
    }

    // all the other functions inherited from Vector
}

Now ArrayVector can no longer inherit from ArrayMatrix, since it already has a superclass. So rather than duplicating the 'other' functions among different implementations, we need to duplicate the implementation-specific 'core' functions and data structures among the different types of that implementation. And yes, we could use this opportunity to implement ArrayVector using a simple double[] rather than double[][], but we also had this option in the previous version, and now we don't have the option to share the code.

This looks like a trade, and a quite favourable one too (there are a lot of functions that have default implementations), but I have a very strong counterargument. I want ImmutableVector to be an ImmutableMatrix. The immutable implementation is a bit of a special case in that it comes with the guarantee that their matrix elements will never change. Users of the library utilize this guarantee by using the Immutable* types (in contrast to the other implementations like Array*, which don't even need to be public). To have an ImmutableVector that is not an ImmutableMatrix, it doesn't make any sense.

This counterargument doesn't actually seem that strong as I originally thought, but anyway... We tried to avoid code duplication, but didn't really succeed. In Java, ArrayVector cannot inherit common Vector behaviour from the API while inheriting common ArrayMatrix behaviour as well. No multiple inheritence allowed. I opted to keep the type hierarchy in the implementation (that means the API needs to consist of interfaces), and moved the common functions to static helper methods.

Static Helper Class

Now I have something like this:

public class MatrixOp {
    public static Matrix plus(Matrix m1, Matrix m2) {
        // add the matrices, return the result
    }
    public static double dot(Vector v1, Vector v2) {
        // ...
    }
    // other similar functions implementing the common operations
}

public class ArrayMatrix implements Matrix {
private double[][] data;
@Override
public int getColumns() {
return data[0].length;
    }
    @Override
    public int getRows() {
return data.length;
    }
@Override
    public double get(int row, int col) {
        return data[row][col];
    }
    @Override
    public Matrix plus(Matrix m) {
return MatrixOp.plus(this, m);
    }
}
public class ArrayVector extends ArrayMatrix implements Vector {
@Override
public int getDimension() {
return getRows();
}
@Override
public double getCoord(int i) {
return get(i, 0);
}
@Override
public Vector plus(Matrix m) {
return (Vector)super.plus(m);
}
    @Override
    public double dot(Vector v) {
        return MatrixOp.dot(this, v);
    }
}

This moves the implementations of the functions to a single point, but the functions themselves still need to be present in the concrete implementation types to do the delegating call. (And when the implementation is a single line, I don't even bother with delegating.) That's still quite a bit of repeating code.

You could propose just removing all the fancy operations from the main types, and making them available to the user as static functions in the helper class. While that makes sense from a design perspective, and it get rids of all code duplication, I still have to say no. Having all the functions on the matrix types themselves is really important to me. In the name of readability.

I am eagerly waiting for default methods, coming in Java 8, that will make the ideal implementation possible.

2013. január 11., péntek

Designing GameCracker - Working Without Nodes

Suddenly I've had an idea. What if the Graph API didn't specify nodes at all? Of course, it is nodes that make up the graph, but are they really necessary for using them? The answer in general is probably yes, but in GameCracker, the interactions with a graph are rather limited. I just might be able to turn nodes, normal nodes, and transformation nodes into implementation detail. That would be glorious! It would make both implementing graphs and using them much simpler.

The idea comes from a handy utility class that I've written called Match. It contains a sequence of valid moves, the positions along this path, and the corresponding nodes in the graph. It had two basic operations:

move: in the current position (the last one in the sequence), make a given move, expanding it in the graph if it didn't exist, and append this move and the new position to the match sequence.
back: undo the last move, update the match to reflect the previous position.

The graph is built and expanded by traversing it using the move operation. The back operation is used to backtrack when we encounter a position where the result is known. Now, how exactly the graph is traversed can be anything from depth-first to externally controlled (like using a chess program to try to select a winning move), but these two operations are pretty much all we need.

Besides the nice abstraction that Match provides, it also takes care of the fact that sometimes there are transformation nodes in the graph, and hides it from the 'user', so the state graph can be explored as if there were no transformation nodes. All in all, introducing this class made everything much cleaner and nicer.

And now my idea is: if I have Match, do I even need the nodes? It seems to me very likely that I don't! The main use case (described above) is handled by the Match operations very comfortably. The big question is if there are any other ways it might be useful to interact with the game graph. If all of these interactions can be solved using Match instead of nodes, then I'm a happy camper.

The only major thing I can think of right now is the 'local search'. The match is at a certain point, and I'm waiting for the chess program to suggest to me which is the winning move, so that I go that way instead of exploring bad moves. While I'm waiting for an answer, I do a search on the graph to see if I can find a good move myself. It can happen, for example, that I'm at a position P, and there is a move M that leads to a position Q which I've already solved. The graph already contains a node with position Q and the result 'White wins', but this position had been reached a different way, and the edge between P and Q is not in the graph yet, so the result of P is still unknown. If I were to ask the graph to add move M, then it would find this Q node, and suddenly P would be solved.

It is very possible that the chess program considers a different move stronger, and would guide me in that direction. It can very well be that this move M is something that a player would never make; but following the 'best' route would involve adding more (possibly much-much more) nodes and positions to the graph unnecessarily. (It is unnecessary with regards to proving the result of P.)

It's like that joke about boiling water. You ask a physicist and a mathematician to boil some water in the tea pot, which you place on the table. The engineer picks up the pot, fills it with water, places it on the stove, and turns the flame on. The mathematician does the same. Next, you put a pot filled with water on the table. The engineer picks it up, places it on the stove, and turns the flame on. The mathematician pours out the water and replaces it onto the table, saying that now we have the original problem that has already been solved. In GameCracker, we want to be the mathematician, to solve positions with minimal work rather than finding 'optimal' solutions, to try to keep the state graph as small as possible.

I might have been sidetracked a bit. Anyway, this local search requires the graph to be able to answer the question: here's this position, what do you know about its result? Fortunately the concept of nodes is not necessary for this operation either. And I cannot think of anything else that would require code outside the graph implementation to be able to refer to individual nodes.

To sum up, I don't see any reason why this change would be a bad idea, but I see quite a number of advantages. I'm quite excited about this! Suddenly all the problems I talked about previously, as well as some other issues with nodes, are mitigated, and with a simpler API. All that was required was to question whether the concept of nodes, such an important part of graphs, was actually necessary to expose.

2012. november 23., péntek

Interrelated Generic Types

When designing software, after various amount of thought and work, I can usually arrive at a solution that seems elegant, reliable, I dare say beautiful. The structure of the class hierarchy is very straightforward and closely mirrors the mental image of the problem, making it easy to understand and handle. There are some cases, however, where I seem to be stuck with something, and I expect that it could be done better but I don't know how. Maybe I am not intelligent enough to find the 'proper' solution, or maybe the language simply lets me down. I have two concrete examples in mind, and in this entry I present the one involving generics.

The example comes from my GameCracker project, but you don't need to be familiar with it to understand the problem. I'll present everything in as much detail as I deem necessary to understand what I want to convey.

Designing a Game API

For modeling a game, I need a type that represent the complete state of the game at a moment in time. I call this type Position. In each position, there may be actions that can be perfomed; these are the Moves. The Position specifies the set of possible Moves, and performing one of these moves leads to a different Position. In code, it looks like this:

interface Move {}

interface Position {
List<Move> getMoves();
Position move(Move move);
}

This is the API, and there are implementations for different games. For example, ChessMove and ChessPosition, or TicTacToeMove and TicTacToePosition.

The types that the API defines are implicitly interrelated, the classes for a single implementation belong together. Like ChessPosition.getMoves will always return objects of type ChessMove; and it makes no sense to try to apply a TicTacToeMove to a ChessPosition. These kinds of constraints can be enforced using generics:

interface Move {}

interface Position<M extends Move> {
List<M> getMoves();
Position<M> move(M move);
}

And the implementation:

class ChessMove implements Move {...}

class ChessPosition implements Position<ChessMove> {
public List<ChessMove> getMoves() {...}
public Position<ChessMove> move(ChessMove move) {...}
}

More Types

That looks fine. What would be even prettier is if the move function returned ChessPosition; there's nothing keeping us to override the move function with this more concrete type, but to enforce this constraint through the type system, we need a type variable for the implementation type of Position.

There are other reasons to have this type variable. In a game, there can be transformations between positions, for example when two positions are the same except that the board is mirrored. So there is a function that compares positions, and another one that applies a transformation to a position. Similarly to how only ChessMoves should be applied to ChessPositions, it makes no sense to try to compare positions of different games.

What the 'transformations' actually mean depends on the game itself and are perhaps not simple symmetries. So this is also a type that is associated with the game implementation. Putting it all together, we have:

interface Transformation {}

interface Move {}

interface Position, M extends Move, T extends Transformation> {
List<M> getMoves();
P move(M move);
T getTransformationTo(P pos);
P transform(T transformation);
}

Now we have three different type variables, but it all makes sense. For a given game, P is the type of the game's positions, M is the type of the game's moves, and T is the type of the transformations that make sense for the game. Now, it turns out that I need some other functions: I need to be able to apply transformations to moves, and I need to compose transformations. All this with the type constraints that a game's move can only be transformed by the game's transformation, and only transformations of the same kind can be composed. With these additions we arrive at the current API used in GameCracker:

interface Transformation<T extends Transformation<T>> {
T compose(T other);
}
interface Move<M extends Move<M,T>, T extends Transformation<T>> {
M transform(T transformation);
}
interface Position, M extends Move<M,T>, T extends Transformation<T>> {
List<M> getMoves();
P move(M move);
T getTransformationTo(P pos);
P transform(T transformation);
}

Those are a lot of generics, although it still makes perfect sense, and implementing this API results in the right types everywhere. Given a set of implementation types, e.g. ChessMove, ChessPosition, and SquareSymmetry, ChessPosition.transform accepts SquareSymmetry and returns ChessPosition, ChessMove.transform accepts SquareSymmetry and return ChessMove etc. The generics have achieved their goal: no casts are necessary. The cost is that code dealing with the API types must carry around the type parameters. Example:

class Match, M extends Move<M,T>, T extends Transformation<T>> {
 List positions;
 List<M> moves;
 ...
 public P getCurrentPosition() {
return positions.get(positions.size()-1);
 }
public void move(M move) {
if (!getCurrentPosition().getMoves().contains(move))
 throw new IllegalArgumentException("Invalid move");
 positions.add(getCurrentPosition().move(move));
 moves.add(move);
 }
}

All the classes or functions that reference a position require the generic baggage of ", M extends Move<M,T>, T extends Transformation<T>>" to be type-safe. Looking at this, I kind of start to doubt if this is the right thing to do, even though technically it is correct, it expresses what I need it to express. The other option I see is not using generics; that would mean casts all over the implementation classes, and nothing preventing users of the API to mix up types belonging to different games. I like this less.

But the rabbit-hole goes deeper.

Introducing Graphs

I need graphs that store game positions in their nodes. I need two different kinds of nodes: normal nodes and transformation nodes. I can imagine several different graph implementations, so I need an API similar to the game API, and once again the implementation types of the API types will have interdependencies. If I use the same approach of defining type parameters, then I will need the following types:

N: the type of the nodes (parent type)
NN: the type of the normal nodes
TN: the type of the transformation nodes

interface Graph<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> {
 NN getRoot();
}
interface Node<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> {
 List<N> getParents();
 NN asNormalNode();
 TN asTransformationNode();
}
interface NormalNode<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> extends Node<N,NN,TN> {
 List<N> getChildren();
}
interface TransformationNode<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>> extends Node<N,NN,TN> {
 NN getChild();
}

A similar generic baggage (<N extends Node<N,NN,TN>, NN extends NormalNode<N,NN,TN>, TN extends TransformationNode<N,NN,TN>>) defines the implementation types.

This achieves the same that I got with the game API, although here I also need to ensure that mutator functions (which are not displayed in this example) cannot be used to mix up nodes of the same type but belonging to different graphs. Of course the type system cannot help with this one, but that's fine.

What's less fine is that a graph is also tied to a game. I am storing positions in the nodes after all; but positions that belong to the same game. This means that to the N-NN-TN type trio, I need to add P-M-T from the game API, making the generic parameter list of the graph types this monstrosity:

<N extends Node<N,NN,TN,P,M,T>,
NN extends NormalNode<N,NN,TN,P,M,T>,
TN extends TransformationNode<N,NN,TN,P,M,T>,
P extends Position<P,M,T>,
M extends Move<M,T>,
T extends Transformation<T>>

And whenever I reference a node of a graph, I'm carrying this list with myself. Now this is too much for me.

I don't really have a good option when it comes to avoiding the P-M-T types in the nodes, because I really cannot refer to positions without them very well. What would Node.getPosition return without them? It could return Position<?,?,?>, but then the code that gets this result cannot do anything sensible with it. Without the move type, the compiler will not let me call Position.move, with the argument type being ? extends Move<?,?>. I could use raw types in the clients of the API, but that's something I don't even want to consider.

Right now I consider my best route of action to be avoiding the generic types for the graph nodes: N-NN-TN. This means that the graph types only need to carry P-M-T, but I need to do casts in the graph implementations. Although if I do that, why not remove P-M-T as well, and do casts in the game implementations? If using generics in the game API is something I want to do, then using generics in the graph API is a natural extension. How can I justify using one but not the other? I don't really know what I'll end up doing.

None of the options I can think of seem to be 'the right way'. I'm not even sure if there is one.

2012. szeptember 28., péntek

GameCracker - 9. Correctness and Trust

When dealing with anything but the simplest game (Tic-Tac-Toe or 4x4 reversi), the program takes up a lot of computing power, disk space, and time. How can I be sure that it even does the right thing, what I intended for it to do?

I can't, of course. All programs, even much simpler than GameCracker, are bound to have bugs. And since I have no way of making sure that GameCracker has none, and I want to set it to work anyway, I just have to suspend my doubts and assume that everything is all right. At least until I have reason to believe otherwise. Because as soon as I have the slightest suspicion that something's wrong, I lose all confidence in the graph I have built so far, even if I'm not sure if it's actually incorrect. (There's no practical way to thoroughly check the data for consistency problems, it would take forever. Maybe not so long as building it, but still...)

For example, in my chess module, I have some test cases that commemorate bugs otherwise long forgotten. One is related to the rule I use for avoiding infinite matches: if a state repeats, then it is a draw. A little bit stricter than the official rules, real games could have repeats, but then I say that my graph still answers the 'who will win' question just fine, if we ignore those moves between the repeating states; those don't matter in any way. So in my model, the chess state comprises of not only the board, and not only the extra information that relates to the availability of some special moves (like castling and en passant) that cannot be deduced from the board itself, but also all those past boards that have a chance of repeating. I can clear this 'past states' set whenever something irreversible happens, like a pawn moves, a piece is captured, or one of the players loses the ability for a castling or an en passant move, but otherwise these matter, and even if two states are the same with regards to the board, the player to move, and all those special moves, if this set is different, that means that I cannot consider these states equivalent. Quite unfortunate it is, but what can you do. You can list a series of moves that leads to a repeating state from one of them, causing a draw, but these same moves applied in the other one result in a new board setup, and the game goes on. The difference seems slight from a common chess player's point of view, but it is actually huge. That sequence of moves leads to a draw from one state, and to a possibly different result from the other, and this can cause the result of these two 'similar' states to be entirely different too. I think I'm overexplaining things, so time to get back to my example.

Long ago I used to have this bug that I only stored the boards itself for the 'past states', and that caused my model to call draws when it shouldn't have. It turned out that it very much matterd who the player to move was in those past states. If we find ourselves at a board that we saw before, but now it is the other player's move, then it is not a repeat, it is not a cycle in the graph. So I had to add the 'player to move' information along with the board in the 'past states'. Needless to say, this invalidated the entire graph, I had to throw it out. I see another test related the interaction with repeating states and en passant, and a third one making sure that if a pawn moves only one square forward, then it cannot be captured en passant. Bugs like these happen all the time, but once you've found many of them, and time passes, and the program seems to work right, you can grow to trust that there are no more problems, and that the graph and everything is valid and correct. Until you notice something that's not.

A couple of days ago I wrote a little routine to count how many nodes were in each category. I started to notice that some category chains were getting long, and it took several seconds to load and go through them. If there were a few chains which were significantly longer than the rest, then maybe a better category algorithm could be introduced which would even out the lengths. So that was the motivation. To gather and count the category lengths, I first had to gather all the categories. With the category head store described in the previous entry, the only way to do this was by going over all the hash tables, all the arrays from the beginning to the end, and fetching the category number from the non-empty cells. This is quite time-consuming actually, the category head store file being about 11 GB in size, but memory mapping helps a lot with such things. Now then, as far as I was doing this, I counted how many category heads I've found in the store; this is the number of categories that I have. Of course this number can be computed from metadata of the store (FullBlockCount * MaxBlockSize + LastBlockSize), so I compared these numbers. And they were different.

I have found an inconsistency in the data. This basically means that all the previous work is ruined. The category heads store is incorrect, who knows what else is wrong. Unless I find the cause of the problem, and it turns out that the graph itself is not affected (very unlikely), I have to throw it all away. And this is a bit of a tragedy. Here's why.

In this run, building the chess graph, I have accumulated about 280 million nodes; the entire data structure is now 110 GB, and the total amount of time (real time) my computer spent on it, though I didn't measure it so don't know the exact number, it is definitely over 250 days. Welp!

I looked at the code for the category head store long and carefully, to see where I increment the metadata and where I put in new elements (there are no removals, so it is not that complex). This inspection gave me no clues. I then checked the category store for any duplicates. (Since it is a map, it shouldn't have any.) I saw the category -17 occured twice. Both occurances had the value 0 associated with them, which is the invalid value that I use in empty cells. And of course I make sure that the invalid value is not accepted by the store. Thinking that maybe I was incorrectly indexing something in the file, I looked at the cell surrounding both these invalid cells. The cells immediately after them were empty, and the ones before them had values in them. I loaded up these states and checked if their categories were those that were stored; they were. Some further luck led me to notice, however, that if I take one of these states, navigate up to its parent node, and in this parent state, I apply the corresponding move, I get a state that is different from what I started with. The board is right, but one has two 'past states', and the other has one. The graph is definitely wrong.

I will have to investigate further, try to find where the bug is. But I also think that I will rewrite the program, trying to build it up better so that I can try and various actual database solutions to store my graph beside my own approach. Compare which is better. I will publish the source code this time, on GitHub, which I'm pretty sure will lead to better design and documentation. Of course I make absolutely no promises whatsoever.

Update: I have found the bug with the chess states. I had changed how the past states are stored, and I forgot half of the 'addPastState' function... This change was so simple that I didn't even think that I could have made a mistake. Well, let this be a lesson: making mistakes is very easy, and it is worth having more thorough tests.

The issue with the category head store, though, is still a mystery. I've looked at the code for such a long time that it now seems simple. I have to assume that I have manipulated the file outside the normal program flow, and made a mistake there, or otherwise my hard drive is malfunctioning. Either way, the program is running once again with recreated data, and I have inserted a whole bunch of runtime sanity tests (like when iterating over a category chain looking for a state, I check the category of each state to see if it is the same as the chain it is in). The minor, pretty much undetectable performance hit can be offset by detecting problems much earlier than otherwise. I suggest it to everyone handling sensitive data structures.

2012. szeptember 25., kedd

GameCracker - 8. Category Heads

When we consider the execution time of the program, the step that can be expected to take the most time is checking to see if a state is already in the graph. That's why I introduced the concept of categories. As a reminder, a category is a hash that limits which states have a possibility to be equal to the new one. For each category, I have a linked list of the nodes which contain states in this category; these category chains are the nodes which I need to check during expansion. I also described that each node contains a link to the next node in the same category chain, so these links are neatly contained in the main data structure. However, I need to know which node is the first in a particular chain. So for the category chains to work, I need an additional data structure that tells me the ID of the first node in a category chain (the category head) given the category. Since both the category itself and the node ID are long values, this data structure is essentially a long-long map.

If the category does a nice job limiting the number of nodes that need inspecting, that means that there are a lot of categories. The time required to look up a category head can become an issue as the number of category chains grows.

With that in mind, I chose a file-based hash table to store the category heads. If I remember correctly, I took a hash map implementation (not the chained java.util.HashMap, but one using open addressing) and adapted it a bit. Here's how it works.

Open addressed hash maps store their elements in arrays; by a hashing function, each key is assigned a position in the array. If the position is already occupied, a free position is found via some probing method. As new elements are added to the map, the array fills up, and collisions become more common; there are more and more elements that require more and more steps to find. This ruins the performance of the data structure. To remedy this, the entire hash map is recreated using a larger array.

This recreation is something that I couldn't fit in my file-based transactional data storage system. Even if I could somehow manage to make it work, it would probably take a lot of time to do it in the file system. So I decided for a compromise. Instead of reallocating the entire array, I append a new array at the end of the file. So my category head store comprises of a sequence of these arrays, which I call blocks. The last block is the one in which new elements are added, until it fills up and a new block is appended. The price of this strategy is that when searching for a key, every single block must be examined to see if it contains the key. So there's the additional price, an overhead besides the probing (caused by collisions).

This is the data structure that I use now. I'd really love to find a better one though, one with better performance and more pleasant growing abilities. Recently I've been wondering how databases would do this. If I created a table with two 'long' columns, the category (which would be the primary key) and the head ID, how would it perform compared to my hash map? Does it have some tricks that I could learn? One of these days I'll take a closer look.

2012. március 25., vasárnap

Generating a Random Combination

Recently I've had my mind on lottery drawing, and I suddenly thought of a way to generate a random combination that is better than how I used to do it.

Here's the basic problem. Given the set of numbers (0, 1, ... n-1), we want to generate a random subset of k elements of this set of n. The random generator function that I can use is Random.nextInt(n), which returns a number between 0 and n-1. Now, if we have already generated some of the numbers, (c[0], c[1], ..., c[i-1]), then how do we generate the next one?

The way I traditionally do it is that I keep generating a random number between 0 and n-1 until I get one which is different from the ones I already have:

List<Integer> numbers=new ArrayList<Integer>();
while (numbers.size()<k) {
 int next=rnd.nextInt(n);
 if (!numbers.contains(next))
 numbers.add(next);
}

I've kept wondering if there if a way to do it that doesn't involve an unbounded loop, and there is. Think of how lottery numbers are drawn. Once a number is selected, it is discarded from the set of possible numbers, and the next one is then chosen uniformly from the rest. We can do something quite similar in code!

Once again, suppose that we have already generated i numbers, and we're looking for the (i+1)-th one. There are (n-i) candidates left to choose from, the only problem is that they're not the numbers from 0 to n-i-1 (which is the interval in which we can generate one of (n-i) numbers). But we can map these two sets! Let me demonstrate on an example. Consider the numbers between 0 and 9, and suppose we have already chosen 3, 4, and 7. Here's our problem:

The numbers we need to choose from:	0 1 2 5 6 8 9
The numbers we can generate from:	0 1 2 3 4 5 6

Notice that we can collapse the first row, giving us a mapping from the 7 numbers between 0 and 6 to the 7 numbers between 0 and 9 but excluding the ones we've already chosen. The question is: if we generate one in the former set, how can we get the actual number it corresponds to? Here's how:

List<Integer> numbers=new ArrayList<Integer>();
for (int i=0; i<k; i++) {
 int next=rnd.nextInt(n-i);
 int nextIndex;
 for (nextIndex=0; nextIndex<i && numbers.get(nextIndex)<=next; nextIndex++)
 next++;
 numbers.add(nextIndex, next);
}

Note that now we maintain the list of selected numbers in ascending order. What we basically do is shift our generated number to the right (increment it) as many times as we need to account for the 'empty slots'. For example, suppose that in the scenario depicted above, we get the random number 3. Since the first number already selected (which is 3) is <= the new number, we increment it to get 4. The next number already selected (4) is once again <= our number, so we increment it again, and get 5. The next number in the list is 7, so we don't increment any more. We've found the number among the remaining ones which corresponds to the generated number 3 (or, to put it in another way, the fourth one among the remaining candidates): 5. We similarly need two increments if we start from the number 4. If the generator gives us 5, then after two increments, we find that the next already chosen number, 7, causes us to increment a third time, giving us 8. And so on.

As a nice side-effect, we can notice that the number of times we needed to increment gives us the index where we have to insert the new number to keep the list sorted (since it is equal to how many numbers we already have which are less than this new one).

I like this algorithm much more than the original one. We can give an upper limit to its execution time, so we don't run into problems even when k is close to n. I'm pretty sure that I'm not the first one to think of this way of generating combinations, but I wonder how come I've never seen this method before.

P.S. There's another, quite pretty approach on Wikipedia. That one is similarly easy to implement, has bound run-time, and I've also not heard of that one before.

2011. június 20., hétfő

Synchronization Strangeness

I've encountered quite a head-scratcher; it took me about an hour to finally realize what was going on.

I'm using a background thread which communicates with the main thread using a SynchronousQueue. For those of you who don't know this class, it can be used to establish rendez-vous points between two threads. One thread calls queue.put(obj), the other one calls queue.take(), and both of these calls block waiting for one another. When the two threads meet up at these calls, an object is transferred from one to the other, and then both functions return.

Now, I had a class in which I create a thread in the static initializer; this thread will be the receiver. Also, it just happens that the first queue.put call took place before the static initializer finished. Basically, I had something like this:

public class Sync {
    private static final SynchronousQueue queue=new SynchronousQueue();
    private static final Thread thread;
    static {
        thread=new Thread("Background") {
            {
                setDaemon(true);
            }

            @Override
            public void run() {
                try {
                    queue.take();
                } catch (InterruptedException e) {
                }
            }

        };
        thread.start();
        try {
           queue.put(5);
       } catch (InterruptedException e) {}    }
    public static void main(String[] args) {
    }
}

What happened when I ran the program is that both the take and the put functions blocked, the SynchronousQueue just didn't seem to work. I jumped into JConsole and it showed that the main thread was waiting inside the put function, and the background thread was at the take function, but its state was RUNNABLE, and the stack trace consisted of a single frame: "Sync$1.run(Sync.java:13)" (the stack trace of the main thread went down quite some way into the put function, ending up at "sun.misc.Unsafe.park(Native Method)").

So if the background thread was RUNNABLE, why was it just standing there? And where exactly was it standing? According to the stack trace, it didn't enter the take function at all.

Finally I realized. It wasn't the take method that was blocking the background thread, it was the access to the queue object. The queue object is a static field of the class, and the class itself needs to be initialized for other threads to access it. (Other than the thread which is performing the initialization itself, that is.) So the access to the field was blocked while the JVM was waiting for the initialization of the class to finish. But it wouldn't finish, since the initialization was blocked on the put function. It was a deadlock involving class initialization.

Class initialization is tricky business, mainly because in general it 'just works', and thus the programmers aren't completely aware of its intricacies. When one tries to be clever, it is not that difficult stumble into weird bugs.