You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

8.6 KiB

Project 1: Navigation

Introduction

For this project, you will train an agent to navigate (and collect bananas!) in a large, square world.

A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal of your agent is to collect as many yellow bananas as possible while avoiding blue bananas.

The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction. Given this information, the agent has to learn how to best select actions. Four discrete actions are available, corresponding to:

  • 0 - move forward.
  • 1 - move backward.
  • 2 - turn left.
  • 3 - turn right.

The task is episodic, and in order to solve the environment, your agent must get an average score of +13 over 100 consecutive episodes.

Getting Started

  1. Download the environment from one of the links below. You need only select the environment that matches your operating system:

    (For Windows users) Check out this link if you need help with determining if your computer is running a 32-bit version or 64-bit version of the Windows operating system.

    (For AWS) If you'd like to train the agent on AWS (and have not enabled a virtual screen), then please use this link to obtain the environment.

  2. Place the file in the DRLND GitHub repository, in the p1_navigation/ folder, and unzip (or decompress) the file.

(Optional) Challenge: Learning from Pixels

After you have successfully completed the project, if you're looking for an additional challenge, you have come to the right place! In the project, your agent learned from information such as its velocity, along with ray-based perception of objects around its forward direction. A more challenging task would be to learn directly from pixels!

To solve this harder task, you'll need to download a new Unity environment. This environment is almost identical to the project environment, where the only difference is that the state is an 84 x 84 RGB image, corresponding to the agent's first-person view. (Note: Udacity students should not submit a project with this new environment.)

You need only select the environment that matches your operating system:

Then, place the file in the p1_navigation/ folder in the DRLND GitHub repository, and unzip (or decompress) the file. Next, open Navigation_Pixels.ipynb and follow the instructions to learn how to use the Python API to control the agent.

(For AWS) If you'd like to train the agent on AWS, you must follow the instructions to set up X Server, and then download the environment for the Linux operating system above.

Dependencies

If you would like to set up your own environment on your own machine, check the dependencies

Approach for the Agent

The goal of the project is to train an RL agent to be able to navigate autonomously in an Unity environement to collect bananas. The agent gets an reward of +1 for collecting a yellow banana and it receives and reward of 1 if it collects a blue banana.

Follow the instructions in Navigation.ipynb to get started with training your own agent or applying and testing my pretrained agent!

The approach for the agent is as follow:

1. Identify the state and action space

The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction. Given this information, the agent has to learn how to best select actions. Four discrete actions are available, corresponding to move forward, move backward, turn left, turn right.

  • 0 - move forward.
  • 1 - move backward.
  • 2 - turn left.
  • 3 - turn right.
  1. Apply a agent with a random policy (given by Udacity)
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
print("Score: {}".format(score))

The task is episodic, and in order to solve the environment, the agent must get an average score of +13 over 100 consecutive episodes.

3. Implement a Deep Q-Learning agent

The used agent is discovering the environment by interaction and recording the observations. By interaction the agent learns a policy by means of the so called try and error princip. As presented in the lessons and the corresponding exercises an Deep Q-Learning Network (DQN) is applied. DQN is used in this case to approximate the Q-function.

The neural networks (model.py) , which is used, has an architecture of 3 fully connected layer with 64 by 64 by 4. The latter value 4 is the state size of the environment and 64 is the number of nodes.

Furthermore, the agent uses experience replay (dqn_agent.py), which allows it to learn from past experiences. The experience is stored in an buffer after interacting with te environment and it consists of the state, action, reward and the next state. During learning the agent can randomly sample through the data and learn from it.

4. Train the DQL agent and optimize parameters

The following hyper parameters in dqn_agent.py were keept the same:

BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 4        # how often to update the network

During traing, the following hyper parameters were optimized slightly to reach better results.
a) n_episodes=500, eps_decay=0.995, eps_end=0.01
Episode 500 Average Score: 12.80

b) n_episodes=500, eps_decay=0.975, eps_end=0.02
Environment solved in 199 episodes! Average Score: 13.04

c) n_episodes=500, eps_decay=0.98, eps_end=0.02
Environment solved in 235 episodes! Average Score: 13.06

Training the agent with the parameters listed in point b) gave the best results in solving the problem, which is presented in the plot of rewards per episode afterwards:

Episode 100	Average Score: 4.19
Episode 200	Average Score: 9.17
Episode 299	Average Score: 13.04
Environment solved in 199 episodes!	Average Score: 13.04

The plotted mean values shows, that the agent solves the problem at an average score of +13 at episode 299 of the training. plot

Further Improvementes

Further improvements could be reached by either:

  • a deeper neural network in the Q-Learning part
  • further algorithm such as Double Deep Q-Network (DDQN)
  • or Dueling Deep Q-Network (Dueling DQN)