You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

114 lines
6.0 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Report\n",
"\n",
"## Environment, State and Action Space\n",
"\n",
"In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.\n",
"The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. \n",
"The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,\n",
"\t• After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.\n",
"\t• This yields a single score for each episode.\n",
"The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.\n",
"\n",
"Vector Observation space type: continuous \n",
"Vector Observation space size (per agent): 8 \n",
"Number of stacked Vector Observation: 3 \n",
"Vector Action space type: continuous \n",
"Vector Action space size (per agent): 2 \n",
"Reward: +0.1 if agent hits the ball over the net or -0.1 if the ball hits the ground or out of bounds \n",
"\n",
"\n",
"\n",
"## Approach\n",
"\n",
"The approach is to use an Multi Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, which is a model-free off-policy algorithm for learning continous actions. The algorithm combines ideas from DPG (Deterministic Policy Gradient) and DQN (Deep Q-Network). It uses Experience Replay and slow-learning target networks from DQN, and it is based on DPG, which can operate over continuous action spaces.\n",
"\n",
"The Multi Agent Deep Deterministic Policy Gradient (MADDPG) algorithm uses an Actor-Critic method:\n",
"classes \"Actor\" and \"Critic\", similarly the agents hyperparameters in the class \"Agent\".\n",
"\n",
"\n",
"\n",
"## Train the DDPG Agent and Optimization Parameters\n",
"\n",
"\n",
"### Neural Network parameters in Actor and Critic Model class\n",
"\n",
"Actor model structure: \n",
"Batch Normalization(states) \n",
"FC1 = Leaky ReLU activation (Neural Network(state size = 24, neuron units = 128))) \n",
"FC2 = Leaky ReLu activation (Neural Network(state size = FC1, neuron units = 128)) \n",
"FC3 = tanh (Neural Network(state size = FC2 = 128, neuron units = actions = 2)) \n",
" \n",
"Critic model structure: \n",
"Batch Normalization(states) \n",
"FC1 = Leaky ReLU activation (Neural Network(state size = 24, neuron units = 128))) \n",
"FC2 = ReLu activation (Neural Network(state size = FC1 + num_agents (=2) + actions (=2), neuron units = 128)) \n",
"FC3 = tanh (Neural Network(state size = FC2 = 128, neuron units = actions = 1)) \n",
"\n",
"\n",
"### Hyperparameters in Agent class\n",
"```python\n",
"BUFFER_SIZE = int(1e5) # replay buffer size\n",
"BATCH_SIZE = 128 # minibatch size\n",
"GAMMA = 0.99 # discount factor\n",
"TAU = 1e-3 # for soft update of target parameters\n",
"LR_ACTOR = 1e-4 # learning rate of the actor\n",
"LR_CRITIC = 1e-3 # learning rate of the critic\n",
"NUM_AGENTS = 2 # Multi agent approach = 2; single agent approach = 1\n",
"LEARNING_LOOPS = 2 # Perform a second learning step in the step function\n",
"```\n",
"\n",
"### Training Reward \n",
"Plot of the reward during the agents training. The agent learns slowely at the beginning, then increases quickly and finally solves the environment in episode 669.\n",
"![plot](plot.png \"Trained Agent\")\n",
"\n",
"See following the reward values:\n",
"```python\n",
"Episode 100\t Average Score: 0.02\n",
"Episode 200\t Average Score: 0.00\n",
"Episode 300\t Average Score: 0.00\n",
"Episode 400\t Average Score: 0.05\n",
"Episode 500\t Average Score: 0.09\n",
"Episode 600\t Average Score: 0.10\n",
"Episode 669\tmax score: 2.60\t average maximum score over the last 10 episodes: 1.36\n",
"Environment solved in 569 episodes!\t Average Score: 0.52\n",
"```\n",
"\n",
"Finally the trained model is applied to prove the agent to be smart. In this final test the agent scores in 1000 episode 1000 an average Score of 2.19!\n",
"\n",
"### Future Improvements\n",
"A logical improvement for the future would be implementing a training process with lots of additional agents. In comparison with the two agent approach, a high number of agents can improves the training process by parallelization of solving the problem. This should decrease the numbere of episodes needed to solve the environment drastically.\n",
"\n",
"In the approach, which I have choosen, one can still optimize the hyperparameters and the architecture of the neural network to improve training.\n",
"Besides DDPG, further algorithms such as PPO, D4PG, A3C an other shall be validated for improving training and the final accuracy.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}