Csaba's coding experiences: szeptember 2012

When dealing with anything but the simplest game (Tic-Tac-Toe or 4x4 reversi), the program takes up a lot of computing power, disk space, and time. How can I be sure that it even does the right thing, what I intended for it to do?

I can't, of course. All programs, even much simpler than GameCracker, are bound to have bugs. And since I have no way of making sure that GameCracker has none, and I want to set it to work anyway, I just have to suspend my doubts and assume that everything is all right. At least until I have reason to believe otherwise. Because as soon as I have the slightest suspicion that something's wrong, I lose all confidence in the graph I have built so far, even if I'm not sure if it's actually incorrect. (There's no practical way to thoroughly check the data for consistency problems, it would take forever. Maybe not so long as building it, but still...)

For example, in my chess module, I have some test cases that commemorate bugs otherwise long forgotten. One is related to the rule I use for avoiding infinite matches: if a state repeats, then it is a draw. A little bit stricter than the official rules, real games could have repeats, but then I say that my graph still answers the 'who will win' question just fine, if we ignore those moves between the repeating states; those don't matter in any way. So in my model, the chess state comprises of not only the board, and not only the extra information that relates to the availability of some special moves (like castling and en passant) that cannot be deduced from the board itself, but also all those past boards that have a chance of repeating. I can clear this 'past states' set whenever something irreversible happens, like a pawn moves, a piece is captured, or one of the players loses the ability for a castling or an en passant move, but otherwise these matter, and even if two states are the same with regards to the board, the player to move, and all those special moves, if this set is different, that means that I cannot consider these states equivalent. Quite unfortunate it is, but what can you do. You can list a series of moves that leads to a repeating state from one of them, causing a draw, but these same moves applied in the other one result in a new board setup, and the game goes on. The difference seems slight from a common chess player's point of view, but it is actually huge. That sequence of moves leads to a draw from one state, and to a possibly different result from the other, and this can cause the result of these two 'similar' states to be entirely different too. I think I'm overexplaining things, so time to get back to my example.

Long ago I used to have this bug that I only stored the boards itself for the 'past states', and that caused my model to call draws when it shouldn't have. It turned out that it very much matterd who the player to move was in those past states. If we find ourselves at a board that we saw before, but now it is the other player's move, then it is not a repeat, it is not a cycle in the graph. So I had to add the 'player to move' information along with the board in the 'past states'. Needless to say, this invalidated the entire graph, I had to throw it out. I see another test related the interaction with repeating states and en passant, and a third one making sure that if a pawn moves only one square forward, then it cannot be captured en passant. Bugs like these happen all the time, but once you've found many of them, and time passes, and the program seems to work right, you can grow to trust that there are no more problems, and that the graph and everything is valid and correct. Until you notice something that's not.

A couple of days ago I wrote a little routine to count how many nodes were in each category. I started to notice that some category chains were getting long, and it took several seconds to load and go through them. If there were a few chains which were significantly longer than the rest, then maybe a better category algorithm could be introduced which would even out the lengths. So that was the motivation. To gather and count the category lengths, I first had to gather all the categories. With the category head store described in the previous entry, the only way to do this was by going over all the hash tables, all the arrays from the beginning to the end, and fetching the category number from the non-empty cells. This is quite time-consuming actually, the category head store file being about 11 GB in size, but memory mapping helps a lot with such things. Now then, as far as I was doing this, I counted how many category heads I've found in the store; this is the number of categories that I have. Of course this number can be computed from metadata of the store (FullBlockCount * MaxBlockSize + LastBlockSize), so I compared these numbers. And they were different.

I have found an inconsistency in the data. This basically means that all the previous work is ruined. The category heads store is incorrect, who knows what else is wrong. Unless I find the cause of the problem, and it turns out that the graph itself is not affected (very unlikely), I have to throw it all away. And this is a bit of a tragedy. Here's why.

In this run, building the chess graph, I have accumulated about 280 million nodes; the entire data structure is now 110 GB, and the total amount of time (real time) my computer spent on it, though I didn't measure it so don't know the exact number, it is definitely over 250 days. Welp!

I looked at the code for the category head store long and carefully, to see where I increment the metadata and where I put in new elements (there are no removals, so it is not that complex). This inspection gave me no clues. I then checked the category store for any duplicates. (Since it is a map, it shouldn't have any.) I saw the category -17 occured twice. Both occurances had the value 0 associated with them, which is the invalid value that I use in empty cells. And of course I make sure that the invalid value is not accepted by the store. Thinking that maybe I was incorrectly indexing something in the file, I looked at the cell surrounding both these invalid cells. The cells immediately after them were empty, and the ones before them had values in them. I loaded up these states and checked if their categories were those that were stored; they were. Some further luck led me to notice, however, that if I take one of these states, navigate up to its parent node, and in this parent state, I apply the corresponding move, I get a state that is different from what I started with. The board is right, but one has two 'past states', and the other has one. The graph is definitely wrong.

I will have to investigate further, try to find where the bug is. But I also think that I will rewrite the program, trying to build it up better so that I can try and various actual database solutions to store my graph beside my own approach. Compare which is better. I will publish the source code this time, on GitHub, which I'm pretty sure will lead to better design and documentation. Of course I make absolutely no promises whatsoever.

Update: I have found the bug with the chess states. I had changed how the past states are stored, and I forgot half of the 'addPastState' function... This change was so simple that I didn't even think that I could have made a mistake. Well, let this be a lesson: making mistakes is very easy, and it is worth having more thorough tests.

The issue with the category head store, though, is still a mystery. I've looked at the code for such a long time that it now seems simple. I have to assume that I have manipulated the file outside the normal program flow, and made a mistake there, or otherwise my hard drive is malfunctioning. Either way, the program is running once again with recreated data, and I have inserted a whole bunch of runtime sanity tests (like when iterating over a category chain looking for a state, I check the category of each state to see if it is the same as the chain it is in). The minor, pretty much undetectable performance hit can be offset by detecting problems much earlier than otherwise. I suggest it to everyone handling sensitive data structures.

When we consider the execution time of the program, the step that can be expected to take the most time is checking to see if a state is already in the graph. That's why I introduced the concept of categories. As a reminder, a category is a hash that limits which states have a possibility to be equal to the new one. For each category, I have a linked list of the nodes which contain states in this category; these category chains are the nodes which I need to check during expansion. I also described that each node contains a link to the next node in the same category chain, so these links are neatly contained in the main data structure. However, I need to know which node is the first in a particular chain. So for the category chains to work, I need an additional data structure that tells me the ID of the first node in a category chain (the category head) given the category. Since both the category itself and the node ID are long values, this data structure is essentially a long-long map.

If the category does a nice job limiting the number of nodes that need inspecting, that means that there are a lot of categories. The time required to look up a category head can become an issue as the number of category chains grows.

With that in mind, I chose a file-based hash table to store the category heads. If I remember correctly, I took a hash map implementation (not the chained java.util.HashMap, but one using open addressing) and adapted it a bit. Here's how it works.

Open addressed hash maps store their elements in arrays; by a hashing function, each key is assigned a position in the array. If the position is already occupied, a free position is found via some probing method. As new elements are added to the map, the array fills up, and collisions become more common; there are more and more elements that require more and more steps to find. This ruins the performance of the data structure. To remedy this, the entire hash map is recreated using a larger array.

This recreation is something that I couldn't fit in my file-based transactional data storage system. Even if I could somehow manage to make it work, it would probably take a lot of time to do it in the file system. So I decided for a compromise. Instead of reallocating the entire array, I append a new array at the end of the file. So my category head store comprises of a sequence of these arrays, which I call blocks. The last block is the one in which new elements are added, until it fills up and a new block is appended. The price of this strategy is that when searching for a key, every single block must be examined to see if it contains the key. So there's the additional price, an overhead besides the probing (caused by collisions).

This is the data structure that I use now. I'd really love to find a better one though, one with better performance and more pleasant growing abilities. Recently I've been wondering how databases would do this. If I created a table with two 'long' columns, the category (which would be the primary key) and the head ID, how would it perform compared to my hash map? Does it have some tricks that I could learn? One of these days I'll take a closer look.

Csaba's coding experiences

2012. szeptember 28., péntek

GameCracker - 9. Correctness and Trust

2012. szeptember 25., kedd

GameCracker - 8. Category Heads