End-to-end reinforcement learning


In end-to-end reinforcement learning, the entire process from sensors to motors in a robot or agent involves a single, layered or recurrent neural network without modularization, and is trained by reinforcement learning. The approach has been proposed for a long time, but was reenergized by the successful results in learning to play Atari video games and AlphaGo by Google DeepMind.
RL traditionally required explicit design of state space and action space, while the mapping from state space to action space is learned. Therefore, RL has been limited to learning only for action, and human designers have to design how to construct state space from sensor signals and to give how the motion commands are generated for each action before learning. Neural networks have been often used in RL, to provide non-linear function approximation to avoid the curse of dimensionality. Recurrent neural networks have been also employed, mainly to avoid perceptual aliasing or partially observable Markov decision process.
End-to-end RL extends RL from learning only for actions to learning the entire process from sensors to motors including higher-level functions that are difficult to develop independently from other functions. Higher-level functions do not connect directly with either sensors or motors, and so even giving their inputs and outputs is difficult.

History

The approach originated in TD-Gammon. In backgammon, the evaluation of the game situation during self-play was learned through TD using a layered neural network. Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level.
Shibata began working with this framework in 1997. They employed Q-learning and actor-critic for continuous motion tasks, and used a recurrent neural network for memory-required tasks. They applied this framework to some real robot tasks. They demonstrated learning of various functions.
Beginning around 2013, Google DeepMind showed impressive learning results in video games and game of Go. They used a deep convolutional neural network that showed superior results in image recognition. They used 4 frames of almost raw RGB pixels as inputs. The network was trained based on RL with the reward representing the sign of the change in the game score. All 49 games were learned using the same network architecture and Q-learning with minimal prior knowledge, and outperformed competing methods on almost all the games and performed at a level that is comparable or superior to a professional human game tester. It is sometimes called Deep-Q network. In AlphaGo, deep neural networks are trained not only by reinforcement learning, but also by supervised learning and Monte Carlo tree search.

Function emergence

Shibata's group showed that various functions emerge in this framework, including:
Communications were established in this framework. Modes include: