You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

849 lines
50 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Collaboration and Competition\n",
"\n",
"---\n",
"\n",
"You are welcome to use this coding environment to train your agent for the project. Follow the instructions below to get started!\n",
"\n",
"### 1. Start the Environment\n",
"\n",
"Run the next code cell to install a few packages. This line will take a few minutes to run!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.\u001b[0m\r\n",
"\u001b[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.18 which is incompatible.\u001b[0m\r\n"
]
}
],
"source": [
"!pip -q install ./python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The environment is already saved in the Workspace and can be accessed at the file path provided below. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:unityagents:\n",
"'Academy' started successfully!\n",
"Unity Academy name: Academy\n",
" Number of Brains: 1\n",
" Number of External Brains : 1\n",
" Lesson number : 0\n",
" Reset Parameters :\n",
"\t\t\n",
"Unity brain name: TennisBrain\n",
" Number of Visual Observations (per agent): 0\n",
" Vector Observation space type: continuous\n",
" Vector Observation space size (per agent): 8\n",
" Number of stacked Vector Observation: 3\n",
" Vector Action space type: continuous\n",
" Vector Action space size (per agent): 2\n",
" Vector Action descriptions: , \n"
]
}
],
"source": [
"from unityagents import UnityEnvironment\n",
"import numpy as np\n",
"\n",
"env = UnityEnvironment(file_name=\"/data/Tennis_Linux_NoVis/Tennis\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# get the default brain\n",
"brain_name = env.brain_names[0]\n",
"brain = env.brains[brain_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Examine the State and Action Spaces\n",
"\n",
"Run the code cell below to print some information about the environment."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of agents: 2\n",
"Size of each action: 2\n",
"There are 2 agents. Each observes a state with length: 24\n",
"The state for the first agent looks like: [ 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. -6.65278625 -1.5 -0. 0.\n",
" 6.83172083 6. -0. 0. ]\n",
"The state for the second agent looks like: [ 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. -6.4669857 -1.5 0. 0.\n",
" -6.83172083 6. 0. 0. ]\n"
]
}
],
"source": [
"# reset the environment\n",
"env_info = env.reset(train_mode=True)[brain_name]\n",
"\n",
"# number of agents \n",
"NUM_AGENTS = len(env_info.agents)\n",
"print('Number of agents:', NUM_AGENTS)\n",
"\n",
"# size of each action\n",
"action_size = brain.vector_action_space_size\n",
"print('Size of each action:', action_size)\n",
"\n",
"# examine the state space \n",
"states = env_info.vector_observations\n",
"state_size = states.shape[1]\n",
"print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))\n",
"print('The state for the first agent looks like:', states[0])\n",
"print('The state for the second agent looks like:', states[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Take Random Actions in the Environment\n",
"\n",
"In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.\n",
"\n",
"Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total score (averaged over agents) this episode: -0.004999999888241291\n",
"Total score (averaged over agents) this episode: -0.004999999888241291\n",
"Total score (averaged over agents) this episode: -0.004999999888241291\n",
"Total score (averaged over agents) this episode: -0.004999999888241291\n",
"Total score (averaged over agents) this episode: -0.004999999888241291\n"
]
}
],
"source": [
"for i in range(5): # play game for 5 episodes\n",
" env_info = env.reset(train_mode=False)[brain_name] # reset the environment \n",
" states = env_info.vector_observations # get the current state (for each agent)\n",
" scores = np.zeros(NUM_AGENTS) # initialize the score (for each agent)\n",
" while True:\n",
" actions = np.random.randn(NUM_AGENTS, action_size) # select an action (for each agent)\n",
" actions = np.clip(actions, -1, 1) # all actions between -1 and 1\n",
" env_info = env.step(actions)[brain_name] # send all actions to tne environment\n",
" next_states = env_info.vector_observations # get next state (for each agent)\n",
" rewards = env_info.rewards # get reward (for each agent)\n",
" dones = env_info.local_done # see if episode finished\n",
" scores += env_info.rewards # update the score (for each agent)\n",
" states = next_states # roll over states to next time step\n",
" if np.any(dones): # exit loop if episode finished\n",
" break\n",
" print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When finished, you can close the environment."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"#env.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. It's Your Turn!\n",
"\n",
"Now it's your turn to train your own agent to solve the environment! A few **important notes**:\n",
"- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:\n",
"```python\n",
"env_info = env.reset(train_mode=True)[brain_name]\n",
"```\n",
"- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file! You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.\n",
"- In this coding environment, you will not be able to watch the agents while they are training. However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary packages\n",
"\n",
"import gym\n",
"import random\n",
"import torch\n",
"import numpy as np\n",
"from collections import deque\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"import workspace_utils"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Actor and Critic Model"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Actor and Critic Model\n",
"\n",
"import numpy as np\n",
"\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"\n",
"def hidden_init(layer):\n",
" fan_in = layer.weight.data.size()[0]\n",
" lim = 1. / np.sqrt(fan_in)\n",
" return (-lim, lim)\n",
"\n",
"class Actor(nn.Module):\n",
" \"\"\"Actor (Policy) Model.\"\"\"\n",
"\n",
" def __init__(self, state_size, action_size, seed, fc1_units=128, fc2_units=128):\n",
" \"\"\"Initialize parameters and build model.\n",
" Params\n",
" ======\n",
" state_size (int): Dimension of each state\n",
" action_size (int): Dimension of each action\n",
" seed (int): Random seed\n",
" fc1_units (int): Number of nodes in first hidden layer\n",
" fc2_units (int): Number of nodes in second hidden layer\n",
" \"\"\"\n",
" super(Actor, self).__init__()\n",
" self.seed = torch.manual_seed(seed)\n",
" self.bn1 = nn.BatchNorm1d(state_size)\n",
" self.fc1 = nn.Linear(state_size, fc1_units) # we have two agents in this environment\n",
" self.fc2 = nn.Linear(fc1_units, fc2_units)\n",
" self.fc3 = nn.Linear(fc2_units, action_size)\n",
" self.bn2 = nn.BatchNorm1d(fc2_units)\n",
" self.reset_parameters()\n",
"\n",
" def reset_parameters(self):\n",
" self.fc1.weight.data.uniform_(*hidden_init(self.fc1))\n",
" self.fc2.weight.data.uniform_(*hidden_init(self.fc2))\n",
" self.fc3.weight.data.uniform_(-3e-3, 3e-3)\n",
"\n",
" def forward(self, state):\n",
" \"\"\"Build an actor (policy) network that maps states -> actions.\"\"\" \n",
" x = self.bn1(state)\n",
" x = F.leaky_relu(self.fc1(x)) # F.relu\n",
" x = F.leaky_relu(self.fc2(x)) # F.relu\n",
" return F.tanh(self.fc3(x))\n",
"\n",
"\n",
"class Critic(nn.Module):\n",
" \"\"\"Critic (Value) Model.\"\"\"\n",
"\n",
" def __init__(self, state_size, action_size, seed, fcs1_units=128, fc2_units=128):\n",
" \"\"\"Initialize parameters and build model.\n",
" Params\n",
" ======\n",
" state_size (int): Dimension of each state\n",
" action_size (int): Dimension of each action\n",
" seed (int): Random seed\n",
" fcs1_units (int): Number of nodes in the first hidden layer\n",
" fc2_units (int): Number of nodes in the second hidden layer\n",
" \"\"\"\n",
" super(Critic, self).__init__()\n",
" self.seed = torch.manual_seed(seed)\n",
" \n",
" self.bn1 = nn.BatchNorm1d(state_size) \n",
" self.fcs1 = nn.Linear(state_size, fcs1_units) # take in the state size\n",
" self.fc2 = nn.Linear((fcs1_units+NUM_AGENTS*action_size), fc2_units) # add actions\n",
" self.fc3 = nn.Linear(fc2_units, 1) # 1 = a single value\n",
" self.bn2 = nn.BatchNorm1d(fc2_units)\n",
" self.reset_parameters()\n",
"\n",
" def reset_parameters(self):\n",
" self.fcs1.weight.data.uniform_(*hidden_init(self.fcs1))\n",
" self.fc2.weight.data.uniform_(*hidden_init(self.fc2))\n",
" self.fc3.weight.data.uniform_(-3e-3, 3e-3)\n",
"\n",
" def forward(self, state, action_agent_0, action_agent_1):\n",
" \"\"\"Build a critic (value) network that maps (state, action) pairs -> Q-values.\"\"\"\n",
" x = self.bn1(state)\n",
" x = F.leaky_relu(self.fcs1(x)) # F.relu\n",
" x = torch.cat((x, action_agent_0, action_agent_1), dim=1)\n",
" x = F.leaky_relu(self.fc2(x)) # F.relu\n",
" return self.fc3(x)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multi Agent Deep Determinitic Policy Gradient Agent with Replay Buffer and Noise"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Multi Agent Deep Determinitic Policy Gradient Agent with Replay Buffer and Noise\n",
"\n",
"import numpy as np\n",
"import random\n",
"import copy\n",
"from collections import namedtuple, deque\n",
"\n",
"import torch\n",
"import torch.nn.functional as F\n",
"import torch.optim as optim\n",
"\n",
"BUFFER_SIZE = int(1e5) # replay buffer size\n",
"BATCH_SIZE = 128 # minibatch size\n",
"GAMMA = 0.99 # discount factor\n",
"TAU = 1e-3 # for soft update of target parameters\n",
"LR_ACTOR = 1e-4 # learning rate of the actor\n",
"LR_CRITIC = 1e-3 # learning rate of the critic\n",
"NUM_AGENTS = 2 # Multi agent approach = 2; single agent approach = 1\n",
"LEARNING_LOOPS = 2 # Perform a second learning step in the step function\n",
"\n",
"device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n",
"\n",
"\n",
"class Agent(object):\n",
" \"\"\"Interacts with and learns from the environment.\"\"\"\n",
"\n",
" def __init__(self, state_size, action_size, random_seed):\n",
" \"\"\"Initialize an Agent object.\n",
" Params\n",
" ======\n",
" state_size (int): dimension of each state\n",
" action_size (int): dimension of each action\n",
" random_seed (int): random seed\n",
" \"\"\"\n",
" self.state_size = state_size\n",
" self.action_size = action_size\n",
" self.seed = random.seed(random_seed)\n",
"\n",
" # Actor Network (w/ Target Network)\n",
" self.actor_local = Actor(state_size, action_size, random_seed).to(device)\n",
" self.actor_target = Actor(state_size, action_size, random_seed).to(device)\n",
" self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)\n",
" print(\"Actor net - local: \", self.actor_local)\n",
" print(\"Actor net - target: \", self.actor_target)\n",
"\n",
" # Critic Network (w/ Target Network)\n",
" self.critic_local = Critic(state_size, action_size, random_seed).to(device)\n",
" self.critic_target = Critic(state_size, action_size, random_seed).to(device)\n",
" self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC)\n",
" print(\"Critic net - local: \", self.critic_local)\n",
" print(\"Critic net - target: \", self.critic_target)\n",
" \n",
" # Init the weights of the target network with the weights of the local network\n",
" self.init_soft_update(self.actor_local, self.actor_target)\n",
" self.init_soft_update(self.critic_local, self.critic_target)\n",
"\n",
" # Noise process\n",
" self.noise = OUNoise(NUM_AGENTS * action_size, random_seed)\n",
"\n",
" # Replay memory\n",
" self.memory = ReplayBuffer(BUFFER_SIZE, BATCH_SIZE, random_seed)\n",
"\n",
" def step(self, states, actions_agent_0, actions_agent_1, rewards, next_state_agent_0, next_state_agent_1, dones):\n",
" \"\"\"Save experience in replay memory, and use random sample from buffer to learn.\"\"\"\n",
" # Save experiences / rewards\n",
" for state, action, action_1, reward, next_state, next_state_1, done \\\n",
" in zip(states, actions_agent_0, actions_agent_1, rewards, next_state_agent_0, next_state_agent_0, dones):\n",
" self.memory.add(state, action, action_1, reward, next_state, next_state_1, done)\n",
"\n",
" # Learn, if enough samples are available in memory\n",
" # Learn x times as specified in LEARNING_LOOPS\n",
" if len(self.memory) > BATCH_SIZE:\n",
" for _ in range(LEARNING_LOOPS):\n",
" experiences = self.memory.sample()\n",
" self.learn(experiences, GAMMA)\n",
"\n",
" def act(self, state, add_noise=True, noise_factor=1.0):\n",
" \"\"\"Returns actions for given state as per current policy.\"\"\"\n",
" state = torch.from_numpy(state).float().to(device)\n",
" self.actor_local.eval()\n",
" with torch.no_grad():\n",
" action = self.actor_local(state).cpu().data.numpy()\n",
" self.actor_local.train()\n",
" if add_noise:\n",
" action += noise_factor * self.noise.sample().reshape((-1, 2))\n",
" return np.clip(action, -1, 1)\n",
"\n",
" def reset(self):\n",
" self.noise.reset()\n",
"\n",
" def learn(self, experiences, gamma):\n",
" \"\"\"Update policy and value parameters using given batch of experience tuples.\n",
" Q_targets = r + γ * critic_target(next_state, actor_target(next_state), actor_target(next_state_other_player))\n",
" where:\n",
" actor_target(state) -> action\n",
" critic_target(state, action, action) -> Q-value\n",
" Params\n",
" ======\n",
" experiences (Tuple[torch.Tensor]): tuple of (s, a, a_2, r, s', s'_2, done) tuples\n",
" gamma (float): discount factor\n",
" \"\"\"\n",
" ##states, actions, actions_other_player, rewards, next_states, next_states_other_player, dones = experiences\n",
" states, actions_agent_0, actions_agent_1, rewards, next_state_agent_0, next_state_agent_1, dones = experiences\n",
"\n",
" # ---------------------------- update critic ---------------------------- #\n",
" # Get predicted next-state actions and Q values from target models\n",
" actions_next_agent_0 = self.actor_target(next_state_agent_0)\n",
" actions_next_agent_1 = self.actor_target(next_state_agent_1)\n",
" Q_targets_next = self.critic_target(next_state_agent_0, actions_next_agent_0, actions_next_agent_1)\n",
" # Compute Q targets for current states (y_i)\n",
" Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))\n",
" # Current expected Q-values\n",
" Q_expected = self.critic_local(states, actions_agent_0, actions_agent_1)\n",
" # Compute critic loss\n",
" critic_loss = F.mse_loss(Q_expected, Q_targets)\n",
" # Minimize the loss\n",
" self.critic_optimizer.zero_grad()\n",
" critic_loss.backward()\n",
" # Gradient clipping\n",
" torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1)\n",
" self.critic_optimizer.step()\n",
"\n",
" # ---------------------------- update actor ---------------------------- #\n",
" # Compute actor loss\n",
" actions_pred = self.actor_local(states)\n",
" actor_loss = -self.critic_local(states, actions_pred, actions_agent_1).mean()\n",
" # Minimize the loss\n",
" self.actor_optimizer.zero_grad()\n",
" actor_loss.backward()\n",
" self.actor_optimizer.step()\n",
"\n",
" # ----------------------- update target networks ----------------------- #\n",
" self.soft_update(self.critic_local, self.critic_target, TAU)\n",
" self.soft_update(self.actor_local, self.actor_target, TAU)\n",
"\n",
" def soft_update(self, local_model, target_model, tau):\n",
" \"\"\"Soft update model parameters.\n",
" θ_target = τ*θ_local + (1 - τ)*θ_target\n",
" Params\n",
" ======\n",
" local_model: PyTorch model (weights will be copied from)\n",
" target_model: PyTorch model (weights will be copied to)\n",
" tau (float): interpolation parameter\n",
" \"\"\"\n",
" for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):\n",
" target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)\n",
"\n",
" def init_soft_update(self, local_net, target_net):\n",
" \"\"\" Init model parameters \n",
" Params\n",
" ======\n",
" local_model: PyTorch model (weights will be copied from)\n",
" target_model: PyTorch model (weights will be copied to)\n",
" \"\"\"\n",
" for target_param, local_param in zip(target_net.parameters(), local_net.parameters()):\n",
" target_param.data.copy_(local_param.data) \n",
" \n",
"\n",
"class OUNoise:\n",
" \"\"\"Ornstein-Uhlenbeck process.\"\"\"\n",
"\n",
" def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):\n",
" \"\"\"Initialize parameters and noise process.\"\"\"\n",
" self.mu = mu * np.ones(size)\n",
" self.theta = theta\n",
" self.sigma = sigma\n",
" self.seed = random.seed(seed)\n",
" self.state = None\n",
" self.reset()\n",
"\n",
" def reset(self):\n",
" \"\"\"Reset the internal state (= noise) to mean (mu).\"\"\"\n",
" self.state = copy.copy(self.mu)\n",
"\n",
" def sample(self):\n",
" \"\"\"Update internal state and return it as a noise sample.\"\"\"\n",
" x = self.state\n",
" dx = self.theta * (self.mu - x) + self.sigma * np.array([random.random() for i in range(len(x))])\n",
" self.state = x + dx\n",
" return self.state\n",
"\n",
"\n",
"class ReplayBuffer:\n",
" \"\"\"Fixed-size buffer to store experience tuples.\"\"\"\n",
" def __init__(self, buffer_size, batch_size, seed):\n",
" \"\"\"Initialize a ReplayBuffer object.\n",
" Params\n",
" ======\n",
" buffer_size (int): maximum size of buffer\n",
" batch_size (int): size of each training batch\n",
" \"\"\"\n",
" self.memory = deque(maxlen=buffer_size) # internal memory (deque)\n",
" self.batch_size = batch_size\n",
" self.experience = namedtuple(\"Experience\", field_names=[\"state\", \"action\", \"action_1\", \"reward\", \"next_state\", \"next_state_1\", \"done\"])\n",
" self.seed = random.seed(seed)\n",
"\n",
" def add(self, state, action, action_1, reward, next_state, next_state_1, done):\n",
" \"\"\"Add a new experience to memory.\"\"\"\n",
" e = self.experience(state, action, action_1, reward, next_state, next_state_1, done)\n",
" self.memory.append(e)\n",
"\n",
" def sample(self):\n",
" \"\"\"Randomly sample a batch of experiences from memory.\"\"\"\n",
" experiences = random.sample(self.memory, k=self.batch_size)\n",
"\n",
" states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)\n",
" actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)\n",
" actions_1 = torch.from_numpy(np.vstack([e.action_1 for e in experiences if e is not None])).float().to(device)\n",
" rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)\n",
" next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(\n",
" device)\n",
" next_states_1 = torch.from_numpy(np.vstack([e.next_state_1 for e in experiences if e is not None])).float().to(\n",
" device)\n",
" dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(\n",
" device)\n",
"\n",
" return states, actions, actions_1, rewards, next_states, next_states_1, dones\n",
"\n",
" def __len__(self):\n",
" \"\"\"Return the current size of internal memory.\"\"\"\n",
" return len(self.memory)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training of the agent (DDPG)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"num_agents: 2\n",
"state_size: 24\n",
"action_size: 2\n",
"Actor net - local: Actor(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fc1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=128, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=2, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Actor net - target: Actor(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fc1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=128, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=2, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Critic net - local: Critic(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fcs1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=132, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=1, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Critic net - target: Critic(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fcs1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=132, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=1, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Episode 100\t Average Score: 0.02rage maximum score over the last 10 episodes: 0.02\n",
"Episode 200\t Average Score: 0.00rage maximum score over the last 10 episodes: 0.00\n",
"Episode 300\t Average Score: 0.00rage maximum score over the last 10 episodes: 0.02\n",
"Episode 400\t Average Score: 0.05rage maximum score over the last 10 episodes: 0.10\n",
"Episode 500\t Average Score: 0.09rage maximum score over the last 10 episodes: 0.10\n",
"Episode 600\t Average Score: 0.10rage maximum score over the last 10 episodes: 0.10\n",
"Episode 669\tmax score: 2.60\t average maximum score over the last 10 episodes: 1.36\n",
"Environment solved in 569 episodes!\t Average Score: 0.52\n"
]
}
],
"source": [
"# Training of the agent (DDPG)\n",
"\n",
"def maddpg(n_episodes=1000, max_t=10000, print_every=100):\n",
" \"\"\"Deep Deterministic Policy Gradients DDPG\n",
" Params\n",
" ======\n",
" n_episodes (int) = maximum number of episodes\n",
" max_t (int) = max number of timesteps per episode\n",
" print_every (int) = orint results every n episodes\n",
" \"\"\"\n",
" scores_window = deque(maxlen=print_every)\n",
" scores = []\n",
" scores_mean = []\n",
" noise_decay = 1.0 # noise multiplication factor\n",
" \n",
" for i_episode in range(1, n_episodes+1):\n",
" env_info = env.reset(train_mode=True)[brain_name] # reset env\n",
" states = env_info.vector_observations # get current states of agent\n",
" score = np.zeros(NUM_AGENTS) # set score/ reward to zero if using multiple agents\n",
" agent.reset()\n",
" \n",
" for t in range(max_t):\n",
" if i_episode < 100:\n",
" actions_agent_0 = np.random.randn(2, 2) # use random actions for the first 100 episodes\n",
" else:\n",
" actions_agent_0 = agent.act(states, noise_factor=noise_decay) # let the agent select actions\n",
" \n",
" actions_agent_1 = np.flip(actions_agent_0, 0) # action of second agent/player\n",
" env_info = env.step(actions_agent_0)[brain_name] # send actions of both agents to env\n",
" rewards = env_info.rewards # get the rewards from env\n",
" next_state_agent_0 = env_info.vector_observations # get next states from env\n",
" next_state_agent_1 = np.flip(next_state_agent_0, 0) # get the resulting states for the second agent/player\n",
" dones = env_info.local_done # check if episode is done\n",
" \n",
" agent.step(states, actions_agent_0, actions_agent_1, rewards, next_state_agent_0, next_state_agent_1, dones) # perform step of agent\n",
" \n",
" # update statistical variables\n",
" states = next_state_agent_0\n",
" score += rewards\n",
" if np.any(dones):\n",
" break\n",
" \n",
" score_max = np.max(score) # the max score of the agents\n",
" scores_window.append(score_max)\n",
" scores.append(score_max)\n",
" scores_mean.append(np.mean(scores_window)) # mean score of the agent\n",
" \n",
" noise_decay = max(0.999 * noise_decay, 0.01) # reduce noise during training\n",
" \n",
" print('\\rEpisode {:d}\\tmax score: {:.2f}\\t average maximum score over the last 10 episodes: {:.2f}'.format(i_episode, scores_window[-1], np.mean(list(scores_window)[-10:])), end=\"\")\n",
" \n",
" if i_episode > 100 and np.mean(scores_window) > 0.5:\n",
" torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')\n",
" torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')\n",
" print('\\nEnvironment solved in {:d} episodes!\\t Average Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))\n",
" break\n",
" if i_episode % print_every == 0:\n",
" print('\\rEpisode {}\\t Average Score: {:.2f}'.format(i_episode, np.mean(scores_window)))\n",
" \n",
" return scores, scores_mean\n",
"\n",
"print(\"num_agents: \", NUM_AGENTS)\n",
"print(\"state_size: \", state_size)\n",
"print(\"action_size: \", action_size)\n",
"with workspace_utils.active_session():\n",
" agent = Agent(state_size=24, action_size=2, random_seed=0)\n",
" scores, scores_mean = maddpg()\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f4484d544a8>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot the scores\n",
"fig = plt.figure()\n",
"ax = fig.add_subplot(111)\n",
"plt.plot(np.arange(len(scores)), scores, label='Score')\n",
"plt.plot(np.arange(len(scores_mean)), scores_mean, label='Mean')\n",
"plt.ylabel('Score')\n",
"plt.xlabel('Episode')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Watch a Smart Agent"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Actor net - local: Actor(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fc1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=128, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=2, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Actor net - target: Actor(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fc1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=128, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=2, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Critic net - local: Critic(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fcs1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=132, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=1, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Critic net - target: Critic(\n",
" (bn1): BatchNorm1d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (fcs1): Linear(in_features=24, out_features=128, bias=True)\n",
" (fc2): Linear(in_features=132, out_features=128, bias=True)\n",
" (fc3): Linear(in_features=128, out_features=1, bias=True)\n",
" (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
")\n",
"Episode 1000\tAverage Score: 2.19"
]
}
],
"source": [
"import torch.optim as optim\n",
"\n",
"agent = Agent(state_size=24, action_size=2, random_seed=0)\n",
"\n",
"agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))\n",
"agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))\n",
"\n",
"n_episodes=1000\n",
"NUM_AGENTS=2\n",
"score = np.zeros(NUM_AGENTS)\n",
"\n",
"env_info = env.reset(train_mode=False)[brain_name] # reset env\n",
"states = env_info.vector_observations # get current states\n",
"\n",
"for i_eps in range(1,n_episodes+1):\n",
" \n",
" actions_agent_0 = agent.act(states, noise_factor=0) # let the agent select actions\n",
" actions_agent_1 = np.flip(actions_agent_0, 0) # action of second agent/player\n",
" env_info = env.step(actions_agent_0)[brain_name] # send actions of both agents to env\n",
" rewards = env_info.rewards # get the rewards from env\n",
" next_state_agent_0 = env_info.vector_observations # get next states from env\n",
" next_state_agent_1 = np.flip(next_state_agent_0, 0) # get the resulting states for the second agent/player\n",
" dones = env_info.local_done # check if episode is done\n",
"\n",
" agent.step(states, actions_agent_0, actions_agent_1, rewards, next_state_agent_0, next_state_agent_1, dones) # perform step of agent\n",
"\n",
" # update statistical variables\n",
" states = next_state_agent_0\n",
" score += rewards\n",
" #if np.any(dones):\n",
" # break \n",
"\n",
"print('\\rEpisode {}\\tAverage Score: {:.2f}'.format(i_eps, np.mean(score)), end=\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}