Chess Piece Face

A couple of months after DeepMind announced their achievement with the AlphaZero chess-playing algorithm, I wrote an entry giving my own thinking on the matter. To a large extent, I wrote of my own self-adjustment from the previous generation of neural network technology to the then cutting-edge techniques which were used to create (arguably) the world’s best chess player. Some of the concepts were new to me, and I was trying to understand them even as I explained them in my own words. Other things were left vague in the summary paper, leaving me to fill in the blanks with guesses about how the programming team might have gone about their work.

It took about another year, but the team eventually released a fully-detailed paper explaining the AlphaZero architecture, even as tech-savvy amateurs (albeit folks far more competent than I) dissected the available information. Notably, this include the creation of open-source reimplementations of AlphaZero‘s proprietary technology. Now, more than three years on, there are many basic-level explanations, using diagrams and examples, explaining to non-experts what AlphaZero has done and how (for just one example, with diagram, see here).

I made a couple of serious mistakes in my previous analysis. Most importantly, I misunderstood the significance of AlphaZero‘s “general-purpose Monte-Carlo tree search (MCTS) algorithm.” At the time I first read that, I interpreted it as simply the ability to randomly-generate valid moves and thus to compile the list of possibilities which the neural network was to rank. An untrained neural net would rank its choices randomly and probably lose*, but could begin to learn how to win as better and better training sets were developed. That isn’t quite accurate. The better way to think about the AlphaZero solution is that it is fundamentally an MCTS algorithm (all caps), but one which uses a Neural Net as a short-cut to estimating parameters where they have yet to be calculated. I think I’ll want to come back to this later, but in the meantime an article using tic-tac-toe to explain MTCS might be educational.

My next error was in my understanding of convolutional networks. If you read my explanation, you might surmise that the source of my error was in my outdated understanding of neural network technology in general. I imagined the convolutional layer as being set of small neural-networks. Further reading (a well illustrated example is here) about the use of convolution in image processing instructed me on the use of filters. This convolution strategy uses transformations that have been well defined through years use for image manipulation and, in fact, often have familiar results to those of us who play with photographs through the likes of Photoshop or GIMP. That a filter which helps sharpen images for the human eye would also help a machine intelligence with edge detection makes sense. In the broader sense, it is a simple method of encapsulating concepts like proximity in a way that does not require a neural net training process to learn about it on its own.

I continue to believe that there is quite a bit of method, borne from extensive trial-and-error, in arranging the game state so that it makes sense, not only to the neural network, but also to those image processing filters. When the problem is identifying elements of an image, those real-life objects occupy distinct regions with the grid of pixels that make up that image. For the “concepts” about a chess board during a match, what is it that causes information to be co-located within a game-turn’s representation? It seems to me that the structure of the input arrays would have to be conceived while thinking that you are turning chess information into something that resembles a photograph. While the paper clearly states that the researchers DIDN’T find the results to depend upon data structure, it would seem this is a lot easier to get wrong than to get right.

I also note, assuming the illustration of AlphaZero‘s architecture in the above paper is correct, that the neural network is extraordinarily complex. At the same time, it seems arbitrarily selected in advance. The dimensions of the architecture are in nice round numbers – 256 convolution filters, 40 residual layers, etc. – which suggests to me one (or both) of two things. First, that these are probably borrowed architectural decisions drawn from long experience with image-processing neural networks. Second, that it may not have been necessary to “optimize” the architecture itself if the results were insensitive to the details. Of course, its also possible that when you’re drowning in excess computing power, you’d be more than happy to vastly oversize your neural network and then rely on the training process to trim the fat rather then invest the manpower to improve the architecture manually. Further, humans may make the right architectural decisions or the wrong one. The network training will, at all times, attempt to make the objectively-best decision AND, perhaps more importantly, continually re-evaluate those decisions with each new run. For example, if you have enough computing power and training data, the fact that 252 out of those 256 filters are excluded every time may not be particularly concerning. Especially if, at some point, a new generation of the network decides that, after all, one of those excluded 252 really is important.

I update my previous entry on this today, not because I think I finally understand it all but, instead, because I don’t. The new bits of understanding that I’ve come by, or maybe just think I’ve come by, suggest new and interesting ways to solve problems other than chess. Hopefully, over the next few weeks, I’ll be able to follow through with this line of thought and continue to explore some more interesting, and far less speculative, thoughts on using machine learning for solutions that are less academic and more accessible to the non-Google-funded hobbyist.

*Except that the networks train against an equally incompetent opponent. Which, if you think about it, is a fundamental flaw with my approach. The network would have no way to bootstrap itself into the realm of mildly-competent player.

Artificial, but Intelligent? Part 2

I just finished reading Practical Game AI Programming: Unleash the power of Artificial Intelligence to your game. My conclusion is there is a lot less to this book than meets the eye.

For someone thinking of purchasing this book, it would be difficult to weigh that decision before committing. The above link to Amazon has (as of this writing) no reviews. I’ve not found any other, independent evaluations of this work. Perhaps you could make a decision simply by studying the synopsis of this book before you buy it. Having done that, it is possible that you’d be prepared for what it offered. Having read the book, and then going back and reading the Amazon summary (which is taken from the publisher’s website), I find that it more or less describes the book’s content. In my case, I picked this book up as part of a Humble Book Bundle, so it was something of an impulse buy. I didn’t dig too hard into the description and instead worked my way through the chapters with my only expectations being based on the title.

Even applying the highest level of pre-purchase scrutiny only gets you so far. The description may indicate that the subject matter is of interest, but it is still a marketing pitch. It gives you no idea of the quality of either the information or the presentation. Furthermore, I think someone got a little carried away with their marketing hype. The description also tosses out some technical terms (e.g. rete algorithm, forward chaining, pruning strategies) perhaps meant to dazzle the potential buyer with AI jargon. The problem is, these terms don’t even appear in the book, much less get demonstrated as a foundation for game programing. I feel that no matter how much upfront research you did before you bought, you’d come away feeling you got less than you bargained for.

What this book is not is a exploration of artificial intelligence as I have discussed that term previously on this website. This is not about machine learning or generic decision-making algorithms or (despite the buzz words) rule-engines. The book mentions applications like Chess only in passing. Instead, the term “AI” is used as a gamer might. It discusses a few tricks that a game programmer can use to make the supposedly-intelligent entities within a game appear to have, well, intelligence when encountered by the player.

The topic that it does cover does, in fact, have some merit. The focus is mostly on simple algorithms and minimal code required to create the impression of intelligent characters within a game. Some of the topics I found genuinely enlightening. The overarching emphasis on simplicity is also something makes sense for programmers to aspire to. There is no need to program a character to have a complex motivation if you can, with only a few lines of code, program him to appear to have such complex motivation. It is just that I’m not sure that these lessons qualify as “unleashing the power of Artificial Intelligence” by anyone’s definition.

But even before I got that far, my impression started off very bad. The writing in this book is rather poor, in terms of grammar, word usage, and content. In some cases, misused words become so jarring as to make it difficult to read through a page. Elsewhere, there will be several absolutely meaningless sentences strung together, perhaps acknowledging that a broader context is required but not knowing how to express it. At first, I didn’t think I was going to get very far into the book. After a chapter or so, however, reading became easier. Part of it may be my getting used to the “style,” if one can call it that. Part of it may also be that there is more “reaching” in the introductory and concluding sections but less when writing about concrete points.

I can’t say for sure but it is my guess, based on reading through the book, that the author does not use English as his primary language. I sometimes wondered if the text was written first in another language and then translated (or mistranslated, as the case may be) into English. Beyond that, the book also does not seem to have benefited from the work of a competent editor.

The structure of the chapters, for the most part, follows a pattern. A concept is introduced by showcasing some “classic” games with the desired behavior. Then some discussion about the principle is followed by coding example, almost always in Unity‘s C# development environment. This is often accompanied by screenshots of Unity’s graphics, either in development mode or in run-time. Most of the chapters, however, feel “padded.” Screenshots are sometimes repetitious. Presentation of the code is done incrementally, with each new addition requiring the re-printing of all of the sample code shown so far along with the new line or lines added in. By the end of the chapter, adding a concept might consist of two explanatory sentences, 3 screenshots, and two pages of sample code, 90% of which is identical to the sample code several pages earlier in the book. This is not efficient and I don’t think it is useful. It does drive the page count way up.

I want to offer a caveat for my review. This is the first book I’ve read from this publisher. When reading about some of their other titles, it was explained that the books come with sample source code. If you buy the book directly from the publisher’s website (which I did not), the sample code is supposed to be downloaded along with the book text. If you buy from a third party, they provide a way to register your purchase on the publisher’s site to get access to the downloads. I did not try this. If this book does have downloaded samples that can be loaded into Unity, and those samples are well-done, that has a potential for adding significant value over the book on its own.

Back to the chapters. When I start going through the chapters, again it feels like there is some “padding” going on to make the subject matter seem more extensive than it is. The book starts with two chapters on Finite State Machines FSM and how that logic can be used to drive an “AI” character’s reactions to the player. Then the book takes a detour into Unity’s support for a Finite State Machine implementation of animations, which has its own chapter. This is most irrelevant to the subject of game AI and also, likely, of little value if you’re not using Unity.

After the animation chapter, we head back into the AI world with a discussion of the A*, and the Theta* variant thereof, pathfinding algorithm. This discussion is accompanied by a manual optimization solution of a simple square-grid based 2D environment, describing each calculation and illustrating each step. I do appreciate the concrete example of the algorithm in action. Many explanations of this topic I’ve found on-line simply show code or pseudo-code and leave it to the “student” to figure it all out. In this case, I think he managed to drive the page count up by and order of magnitude over what would have been sufficient to explain it clearly.

The final chapters show how Unity’s colliders and raycasting can be used to implement both collision avoidance and vision/detection systems. These are two very similar problems involving reacting to other objects in the environment that, themselves, can move around. As I said earlier, there are some useful concepts here, particularly in emphasizing a “keep it simple” design philosophy. If you can use configurable attributes on your development tool’s existing physics system to do something, that’s much preferable to generating your own code base. That goes double if the perception for the end user is indistinguishable, one method from the other. However, I also get the feeling that I’m just being shown some pictures of simple Unity capabilities, rather than “unleashing the power of AI” in any meaningful sense.

A few year’s back, I was trying to solve a similar problem, but trying to be predictive about the intent of the other object. For example, if I want to plot an intercept vector to a moving target but that target is not, itself, moving at a constant rate or direction, I need a good bit more math than the raycasting and colliders provide out of the box. Given the promise of this book’s subject matter, that might be a problem I’d expect to find, perhaps in the next chapter.

Alas, after discussing the problem of visual detection involving both direction and obstacles, the book calls an end to its journey. With the exception of the A* algorithm, the AI solutions consist almost entirely of Unity 3D geometry calls.

Although the book claims to be written in a way such that each chapter can be applied to a wide range of games, I feel like it narrows its focus as it progresses. The targeted game is, and I struggle with how to describe it so I’ll just pick an example, the heirs to the DOOM legacy. By this, I mean games where the player progresses through a series of “levels” in order to complete the game. What the player encounters through those levels is imagined and created by the designer so as to construct the story of the game. The term AI, then, distinguishes between different kinds of encounters, at least as far as the player perceives them. For example, the player might find herself rushing across a bridge, which starts to collapse when they reach the middle. This requires no “AI.” There is simple programmed in that, when the player reaches a certain point on the bridge, call the “collapseBridge” routine. If she makes it past the bridge and into the next chamber, where there are a bunch of gremlins that want to do her in, the player starts considering the “AI” of those gremlins. Do they react to what the player does, adopting different tactics depending on her tactics? If so, she might praise the “AI.” By the books end, the focus is entirely on awareness of and reaction between mobile elements of a game which, by defining the problem as such, is the subset of games in this category.

My harping on the narrow focus of this book goes to the determination of its value. If this book were free or very low cost, you would have to decide whether the poor use of English and the style detract from whatever useful information is presented. The problem with that is the price this book asks. The hardcopy (paperback) of the book is $50.00. The ebook is $31.19 on Amazon, discounted to $28 if you buy directly from the publisher’s site. All of those seem like a lot of money, per my budget. Now, my own price I figure to have been $7. I bought the $8 bundle package over the $1 package purely based on interest in this title. This is the first book in that set I’ve read, so if some of the others are good, I might consider the cost to be even lower. Still, even at $5, I feel like I’ve been cheated a bit by the content of this book.

The bundle contained other books from this same publisher, so I’ll plan to read at least one other before drawing any conclusions about their whole library. Assuming that the quality of this book is, in fact, an outlier, this is still a risk to the publisher’s reputation. When one of your books is overpriced and oversold, the cautious buyer should assume that they are all overpriced and oversold. Looking at the publisher’s site, this book has nothing but positive reviews. It’s really a blemish on the publisher as a whole.

Although I won’t go so far as to say “I wish I hadn’t wasted the time I spent reading this,” I can’t imagine any purchaser for whom this title would be worth the money.

ABC Easy as 42

Teacher’s gonna show you how to get an ‘A’

In 1989, IBM hired a team of programmers out of Carnegie Mellon University. As part of his graduate program, team leader Feng-hsiung Hsu (aka Crazy Bird) developed a system for computerized chess playing that the team called Deep Thought. Deep Thought, the (albeit fictional) original, was the computer created in Douglas Adam’s The Hitchhiker’s Guide to the Galaxy to compute the answer for Life, the Universe, and Everything. It was successful in determining the answer was “42,” although it remained unknown what the question was. CMU’s Deep Thought, less ambitiously, was a custom designed hardware-and-software solution for solving the problem of optimal chess playing.

Once at IBM, the project was renamed Deep Blue, with the “Blue” being a reference to IBM’s nickname of “Big Blue.”

On February 10th, 1996, Deep Blue won its first game against a chess World Champion, defeating Garry Kasparov. Kasparov would go on to win the match, but the inevitability of AI superiority was established.

Today, computer programs being able to defeat humans is no longer in question. While the game of chess may never be solved (à la checkers), it is understood that the best computer programs are superior players to the best human beings. Within the chess world, computer programs only make news for things like when top players may using programs to gain an unfair advantage in tournament play.

Nevertheless, a chess-playing computer was in the news late last year. Headlines reported that a chess playing algorithm based on neural networks, starting only from the rules of legal chess moves, in four hours created a program that could beat any human and nearly all top-ranked chess programs. The articles spread across the internet through various media outlets, each summary featuring their own set of distortions and simplifications. In particular, writers that had been pushing articles about the impending loss of jobs to AI and robots jumped on this as proof that the end had come. Fortunately, most linked to the original paper rather than trying to decipher the details.

Like most I found this to be pretty intriguing news. Unfortunately, I also happen to know a little (just a little, really) about neural networks, and didn’t even bother to read the whole paper before I started trying to figure out what had happened.

Some more background on this project. It was created at DeepMind, a subsidiary of Alphabet, Inc. This entity, formerly known simply as Google, reformed itself in the summer of 2015 with the new Google being one of many children of the Alphabet parent. Initial information suggested to me an attempt at creating one held company for each letter of the alphabet, but time has shown that isn’t their direction. As of today, while there are many letters still open, several have multiple entries. Oh well, it sounded more fun my way. While naming a company “Alphabet” seems a bit uninspired, there is a certain logic to removing the name Google from the parent entity. No longer does one have to wonder why an internet company is developing self-driving cars.

Google’s self driving car?

 

The last time the world had an Artificial Intelligence craze was in the 1980s into the early 1990s. Neural networks were one of the popular machine intelligence techniques of that time too. At first they seemed to offer the promise of a true intelligence; simply mimicking the structure of a biological brain could produce an ability to generalize intelligence, without people to craft that intelligence in code. It was a program that could essentially teach itself. The applications for such systems seemed boundless.

Unfortunately, the optimism was quickly quashed. Neural networks had a number of flaws. First, they required huge amounts of “training” data. Neural Nets work by finding relationships within data, but that source data has to be voluminous and it has to be suited to teaching the neural network. The inputs had to be properly chosen, so as to work well with the networks’ manipulation of that data and the data themselves had be properly representative of the space being modeled. Furthermore, significant preprocessing was required from the person organizing the training. Additional inputs would result in exponential increases in both the training data requirement and the amount of processing time to run through the training.

It is worthwhile to recall the computer power available to neural net programmers of that time. Even a high-end server of 35 years ago is probably put to shame by the Xbox plugged into your television. Furthermore, the Xbox is better suited to the problem. The mathematics capability of Graphical Processing Units (GPUs) is a more efficient design for solving these kinds of matrix problems. Just like Bitcoin mining, it is the GPU on a computer that is going to best be able to handle neural network training.

To illustrate, let me consider briefly a “typical” neural network application of the previous generation. One use is something called a “soft sensor.” Another innovation of that same time was the rapid expansion in capabilities of higher-level control systems for industrial processes. For example, some kind of factory wide system could collect real-time data (temperatures, pressures, motor speeds – whatever is important) and present them in an organized fashion to give an overview of plant performance and, in some cases, automate overall plant control. For many systems however, the full picture wasn’t always available in real time.

Let’s imagine the production of a product which has a specification limiting the amount of some impurity. Largely, we know what the right operating parameters of the system are (temperatures, pressures, etc) but to actually measure for impurities, we manually draw a sample, send it off to a lab for testing, and wait a day or two for the result. It would stand to reason that, in order to keep your product within spec, you must operate far enough away from the threshold that if it begins to drift, you would usually have time catch it before it goes out of spec. Not only does that mean you’re, most of the time, producing a product that exceeds specification (presumably at extra cost), but if the process ever moves faster than expected, you may have to trash a day’s worth of production created while you were waiting for lab results.

Enter the neural network and that soft sensor. We can create a database of the data that were collected in real time and correlate that data with the matching sample analyses that were available afterward. Then a neural network can be trained using the real-time measurements as input to produce an output predicting sample measurement. Assuming that the lab measurement is deducible from the on-line data, you now have in your automated control system (or even just as a presentation to the operators) a real time “measurement” of data that otherwise won’t be available until much later. Armed with that extra knowledge, you would expect to both cut operating costs (by operating tighter to specification) and prevent waste (by avoiding out-of-spec conditions before they happen).

That sounds very impressive, but I did use the word “assuming.” There were a lot factors that had to come together before determining that a particular problem was solvable with neural networks. Obviously, the result you are trying to predict has to, indeed, be predictable from the data that you have. What this meant in practice is that implementing neural networks was much bigger than just the software project. It often meant redesigning your system to, for example, collect data on aspects of your operation that were never necessary for control, but are necessary for the predictive functioning of the neural net. You also need lots and lots of data. Operations that collected data slowly or inconsistently might not be capable of providing a data set suitable for training. Another gotcha was that collecting data from a system in operation probably meant that said system was already being controlled. Therefore, a neural net could just as easily be learning how your control system works, rather than the underlying fundamentals of your process. In fact, if your control reactions were consistent, that might be a much easier thing for the neural net to learn that the more subtle and variable physical process.

The result was that many applications weren’t suitable for neural networks and others required a lot of prep-work. Projects might begin with redesigning the data collection system to get more and better data. Good data sets in hand, one now was forced into time-intensive data analysis which was necessary to ensure a good training set. For example, it was often useful to pre-analyze the inputs to eliminate any dependent variables. Now, technically, that’s part of what the neural network should be good at – extracting the core dependencies from a complex system. However, the amount of effort – in data collected and training time – increases exponentially when you add inputs and hidden nodes, so simplifying a problem was well worth the effort. While it might seem like you can always just collect more data, remember that the data needed to be representative of the domain space.  For example, if the condition that results in your process wandering off-spec only occurs once every three or four months, then doubling your complexity might mean (depending on your luck) increasing the data collection from a month or two to over a year.

Hopefully you’ll excuse my trip down a neural net memory lane, but I wanted to set your expectations of neural network technology where mine were, because the state of the art is very different than what it was. We’ve probably all seen some of the results with image recognition that seems to be one of the hottest topics in neural networks these days.

So back to when I read the article. My first thought was to think in terms of the neural network technology as I was familiar with it.

My starting point to design my own chess neural net has to be representations of the board layout. If you know chess, you probably have a pretty good idea how to describe a chess board. You can describe each piece using a pretty concise terminology . In this case, I figure it is irrelevant where a piece has been. So whether it started as a king’s knight’s pawn or a queen’s rook’s pawn, that doesn’t effect its performance. So you have 6 possible piece descriptors which need to be placed into the 64 squares that they could possibly reside upon. So, for example, imagine that I’m going to assign an integer to the pieces, and then use positive for white and negative for black:

Pawn Knight Bishop Rook King Queen
1 2 3 4 5 6

My board might look something like this 4,2,3,6,5,3,2,4,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0…-3,-2,-4

If I am still living in the 90s, I’m immediately going to be worried about the amount of data, and might wonder if I can compress the representation of my board based on assumptions about the starting positions. I’ve got all those zeros in the center of my matrix, and as the game progresses, I’m going to be getting fewer data and more zeros. Sixty-four inputs seems like a lot (double that to get current position and post-move position), and I might hope to winnow that down some manageable figure with the kind of efforts that I talked about above.

If I hadn’t realized my problem already, I’d start to figure it out now. Neural networks like inputs to be proportional. Obviously, binary inputs are good – something either affects the prediction or doesn’t. But for variable inputs, the variation must make sense in terms of the problem you are solving. Using the power output of a pump as an input to a neural network makes sense. Using the model number of that pump, as an integer, wouldn’t make sense unless there is a happenstancial relationship between the serial number and some function meaningful to your process. Going back to my board description above, I could theoretically describe the “power” of my piece with a number between 1 and 10 (as an example), but any errors in my ability to accurately rank my pieces contribute to prediction errors. So is a Queen worth six times a pawn or nine? Get that wrong, and my neural net training has an inaccuracy built in right up front. And, by the way, that means “worth” to the neural net, not to me or other human players.

A much better way to represent a chess game to a mathematical “intelligence” is to describe the pieces. So, for example, each piece could be described with two inputs, describing its deviation from that piece’s starting position in the X and Y axes, with perhaps a third node to indicate whether the piece is on the board or captured. My starting board then becomes, by definition, 96 zeros, with numbers being populated (and generally growing) as the pieces move. It’s not terribly bigger (although rather horrifyingly so to my 90s self) than the representation by board, and I could easily get them on par by saying, for example, that pieces captured are moved elsewhere on the board, but well out of the 8X8 grid. Organizing by the pieces, though, is both non-intuitive for we human chess players and, in general, would seem less efficient in generalizing to other games. For example, if I’m modelling a card game (as I talked about in my previous post), describing every card, and each of their possible positions; that is a much bigger data set than just describing what is in each hand and on the table. But, again, it should be clear that the description of the board is going to be considerably less meaningful as a mathematical entity than the description created by working from each game piece.

At this point, it is worth remembering again that this is no longer 1992. I briefly mentioned the advances in computing, both in power and in structure (the GPU architecture as superior for solving matrix math). That, in turn, has advanced the state of the art in neural network design and training. The combination goes a long way in explaining why image recognition is once again eyed as a problem for neural networks to address.

Consider the typical image. It is a huge number of pixels of (usually) highly-compressible data. But compressing the data will, as described above, befuddle the neural network. On the other hand, those huge, sparse matrices need representative training data to evenly cover the huge number of inputs, with that need increasing geometrically. It can quickly become, simply, too much of a problem to solve in a timely manner no matter what kind of computing power you’ve got to throw at it. But with that power, you can do new and interesting things. A solution for image recognition is to use “convolutional” networks.

Not to try to be too technically correct, I’ll try to capture the essence of this technique. The idea is that the input space can be broken up into sub-spaces (in an image, small fractions of the image), that then feed a significantly smaller neural network. Then, one might assume that those small networks are all the same or similar to each other. For an image recognition, we might train 100s or even 1000s of networks operating on 1% of the image (in overlapping segments), creating a (relatively) small output based on the large number of pixels. Then those outputs feed a whole-image network. It is still a massive computational problem, but immensely smaller than the problem of training a network processing the entire image as the input.

Does that make a chess problem solvable? It should help, especially if you have multiple convolutional layers. So there might be a neural network that describes each piece (6 inputs for old/new position (2D) plus on/off board) and reduces it to maybe 3 outputs. A second could map similar pieces.. where are the bishops? where are the pawns? Another sub-network, repeated twice, could try just looking at one player at a time. It is still a huge problem, but I can see that it’s something that is becoming solvable given some time an effort.

Of course, this is Alphabet, Inc we are talking about. They’ve got endless supplies of (computing) time and (employee) effort, so if it is starting to look doable to a mere human like me, it is certainly doable for them.

At this point, I went back to the research paper wherein I discovered that some of my intuition was right, although I didn’t fully appreciate that last point. Just as a simple example, the input layer for the DeepMind system is to represent each piece as a board showing the position of the piece. So 32X a 64-by-64 positional grid. They also use a history of turns, not just current and next turn. It is orders-of-magnitude more data than I anticipated, but in extremely sparse data sets. In fact, it looks very much like image processing, but with much more ordered images (to a computer mind, at least). The paper states they are using Tensor Processing Units, a Google concoction meant to use hardware having similar advantages to the GPU and it’s matrix-math specialization, but further optimized specifically to solve this kind of neural network training problem.

So lets finally go back to the claim that got all those singularity-is-nigh dreams dancing in the heads of internet commentators. The DeepMind team were able to train in a matter of (really) twenty-four hours a superhuman level chess player with no a priori chess knowledge. Further, the paper states that the training set consists of 800 randomly-generate games (constrained only to be made up of legal moves), which seems like an incredibly small data set. Even realizing how big those representations are (with their sparse descriptions of the piece locations as well as per-piece historical information), it all sounds awfully impressive. Of course, that is 800 games per iteration. If I’m reading right, that might be 700k iterations in over 9 hours using hardware nearly inconceivable to we mortals.

And that’s just the end result of a research project that took how long? To get to that point where they could hit the “run” button took certainly months, and probably years.

First you’ve got to come up with the data format, and the ability to generate games in that format. Surprisingly, the paper says that the exact representation wasn’t a significant factor. I suppose that it an advantage of its sparseness. Next, you’ve got to architect that neural net. How many convolutions over what subsets? How many layers? How many nodes? That’s a huge research project, and one that is going to need huge amounts of data – not the 800 randomly generated games you used at the end of it all.

The end result of all this – after a process involving a huge number of PhD hours and petaFLOPS of computational power – you’ve created a brain that can do one thing; learn about chess games. Yes, it is a brain without any knowledge in it – a tabula rasa – but it is a brain that is absolutely useless if provided knowledge about anything other than playing chess.

It’s still a fabulous achievement, no doubt. It is also research that is going to be useful to any number of AI learning projects going forward. But what it isn’t is any kind of demonstration that computers can out-perform people (or even mice, for that matter) in generic learning applications. It isn’t a demonstration that neural nets are being advanced into the area of general learning. This is not an Artificial Intelligence that that could be, essentially, self-teaching and therefore life-like in terms of its capabilities.

And, just to let the press know, it isn’t the end of the world.