Chess Piece Face

A couple of months after DeepMind announced their achievement with the AlphaZero chess-playing algorithm, I wrote an entry giving my own thinking on the matter. To a large extent, I wrote of my own self-adjustment from the previous generation of neural network technology to the then cutting-edge techniques which were used to create (arguably) the world’s best chess player. Some of the concepts were new to me, and I was trying to understand them even as I explained them in my own words. Other things were left vague in the summary paper, leaving me to fill in the blanks with guesses about how the programming team might have gone about their work.

It took about another year, but the team eventually released a fully-detailed paper explaining the AlphaZero architecture, even as tech-savvy amateurs (albeit folks far more competent than I) dissected the available information. Notably, this include the creation of open-source reimplementations of AlphaZero‘s proprietary technology. Now, more than three years on, there are many basic-level explanations, using diagrams and examples, explaining to non-experts what AlphaZero has done and how (for just one example, with diagram, see here).

I made a couple of serious mistakes in my previous analysis. Most importantly, I misunderstood the significance of AlphaZero‘s “general-purpose Monte-Carlo tree search (MCTS) algorithm.” At the time I first read that, I interpreted it as simply the ability to randomly-generate valid moves and thus to compile the list of possibilities which the neural network was to rank. An untrained neural net would rank its choices randomly and probably lose*, but could begin to learn how to win as better and better training sets were developed. That isn’t quite accurate. The better way to think about the AlphaZero solution is that it is fundamentally an MCTS algorithm (all caps), but one which uses a Neural Net as a short-cut to estimating parameters where they have yet to be calculated. I think I’ll want to come back to this later, but in the meantime an article using tic-tac-toe to explain MTCS might be educational.

My next error was in my understanding of convolutional networks. If you read my explanation, you might surmise that the source of my error was in my outdated understanding of neural network technology in general. I imagined the convolutional layer as being set of small neural-networks. Further reading (a well illustrated example is here) about the use of convolution in image processing instructed me on the use of filters. This convolution strategy uses transformations that have been well defined through years use for image manipulation and, in fact, often have familiar results to those of us who play with photographs through the likes of Photoshop or GIMP. That a filter which helps sharpen images for the human eye would also help a machine intelligence with edge detection makes sense. In the broader sense, it is a simple method of encapsulating concepts like proximity in a way that does not require a neural net training process to learn about it on its own.

I continue to believe that there is quite a bit of method, borne from extensive trial-and-error, in arranging the game state so that it makes sense, not only to the neural network, but also to those image processing filters. When the problem is identifying elements of an image, those real-life objects occupy distinct regions with the grid of pixels that make up that image. For the “concepts” about a chess board during a match, what is it that causes information to be co-located within a game-turn’s representation? It seems to me that the structure of the input arrays would have to be conceived while thinking that you are turning chess information into something that resembles a photograph. While the paper clearly states that the researchers DIDN’T find the results to depend upon data structure, it would seem this is a lot easier to get wrong than to get right.

I also note, assuming the illustration of AlphaZero‘s architecture in the above paper is correct, that the neural network is extraordinarily complex. At the same time, it seems arbitrarily selected in advance. The dimensions of the architecture are in nice round numbers – 256 convolution filters, 40 residual layers, etc. – which suggests to me one (or both) of two things. First, that these are probably borrowed architectural decisions drawn from long experience with image-processing neural networks. Second, that it may not have been necessary to “optimize” the architecture itself if the results were insensitive to the details. Of course, its also possible that when you’re drowning in excess computing power, you’d be more than happy to vastly oversize your neural network and then rely on the training process to trim the fat rather then invest the manpower to improve the architecture manually. Further, humans may make the right architectural decisions or the wrong one. The network training will, at all times, attempt to make the objectively-best decision AND, perhaps more importantly, continually re-evaluate those decisions with each new run. For example, if you have enough computing power and training data, the fact that 252 out of those 256 filters are excluded every time may not be particularly concerning. Especially if, at some point, a new generation of the network decides that, after all, one of those excluded 252 really is important.

I update my previous entry on this today, not because I think I finally understand it all but, instead, because I don’t. The new bits of understanding that I’ve come by, or maybe just think I’ve come by, suggest new and interesting ways to solve problems other than chess. Hopefully, over the next few weeks, I’ll be able to follow through with this line of thought and continue to explore some more interesting, and far less speculative, thoughts on using machine learning for solutions that are less academic and more accessible to the non-Google-funded hobbyist.

*Except that the networks train against an equally incompetent opponent. Which, if you think about it, is a fundamental flaw with my approach. The network would have no way to bootstrap itself into the realm of mildly-competent player.