Original Paper: Dota 2 with Large Scale Deep Reinforcement Learning
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
Overall comments: this paper is like a cutting-edge tutorial in how to *structure and implement* an AI project/experiment. The key contribution is in the *way* they perform this massive experiment. ... The algorithms they use are fairly standard, so don't read it as a _scientific_ paper. This is almost purely an engineering paper, but splendidly so!
These phrases - "long time horizons", "partial observability" and "high dimensionality" are described 5 paragraphs later, so don't worry if you don't understand them right now!...
What they are saying essentially here, is that given the nature of the game, their main problem was not "what kind of network could we train to play this game?" but rather, "how could we train our mod...el long and hard enough for a game this big??" Because theoretically, they already had algorithms that could conquer a game like Dota. The real problem was in implementation of such a training system.
Ooh. Since their game environment *and* code both kept changing from time to time, they would have to keep throwing away the model they had trained. The way to get around this was to develop tools ...that could "save" what a model had learned and transfer it to a new model architecture. Wow! Now, you don't have to throw away everything and restart from scratch! This is a pretty cool accomplishment.
Hmmm it's not clear if they're trying to focus on challenges specific to Dota2 compared to OTHER games, or challenges from ALL video games. If it's the former, then talking about only chess and Go... don't make sense - like Aman mentioned, others have tried to play Atari, Doom, Capture the Flag, Starcraft and what not. If it's the latter, then comparing to ANY other game doesn't make sense.
For long time horizons, it's a little nonsensical to compare Dota2 with chess and Go! Of course readers know that video games and board games are different. They should compare it with DeepMind's ATAR...I project instead, which is also a video game.
Can anyone who actually plays Dota talk about the implications of the choice of heroes? In their Appendix P (in the original paper), they say that expanding the pool size for heroes led to much slo...wer learning. They don't really give an explanation for why this happens.
Yep, acting on only the 4th timestep seems standard now- the original Atari paper by DeepMind also acted on the 4th timestep....
Argh. Once again, it seems that experience playing Dota2 would be very beneficial in reading this paper. I wonder how many AI researchers play video games that often. :)...
About the claims that the discrepancies don't introduce bias while benchmarking against human players - during training that's definitely true. I'm going to nitpick a little here - during an actua...l match, wouldn't bypassing the visual understanding/computer vision steps give a speed advantage to the computer which compounds over the course of the game?
DO READ Appendix G in the original paper! That's where they introduce one of the paper's signature contributions, the "TEAM SPIRIT" parameter that they seem to tweak manually. In Appendix G they also ...show the nature of how much reward an agent gets for different accomplishments.
770+ PFlops for 10 straight months! I'm curious how much it cost to train OpenAI Five. This article estimates the AlphaGoZero (which is definitely much less intensive) to alone cost over $30 millio...n (it sounds crazy but not too far out there): https://www.yuzeh.com/data/agz-cost.html
So, like 40% of the online games that they won were abandoned by humans for whatever reason. But still, a 99.4% win rate is no joke....
Put simply, the point of surgery was to do experiments more quickly without losing much performance, and *not* to produce the perfect model. Once they arrived at the final model architecture etc a...nd trained THAT from scratch, it went bazooka and even beat OpenAI Five! But if they were to try to retrain this bazooka from scratch *again and again*, they estimate that the whole project would have taken 4 times the resources. They could have explained it more simply however.
The training samples only correspond to *small portions* of many different games played with many different policies. Games are played in small portions and the policies were constantly being updated....
Wait, so even while playing one game, the AI was evolving?...
For references and cited articles, please visit the original publication.