Markov decision processes

Published:
Published:

Neural networks, classification and clusterization algorithms are considered to be intelligent as far as they are part of Artificial Intelligence. But can they create novel independent decisions? Plan their execution algorithm and improve it over time?

And when I ask these questions I don't want to focus on specific aspects like "improving the algorithm". Because I can manually add some genetic algorithm to the system, and voila, it improves some characteristic over time. Instead I want the system to decide itself if it needs improvement, or a change - all of it depending on its goals.

In another words, if the environment changes, the system can use its prior knowledge to learn and adapt to new conditions. In terms of current AI paradigm: it doesn't need new training and testing set and training process for many hours and days on many GPUs. Is it possible?

Appropriate keywords:

  • incremental learning
  • generative neural networks

Papers

Code

May be good for the implementation

My implementation of MDP didn't converge. In reinforcement learning the reward function must be designed for the task, and I put there wrong values. I put 10 points for reaching the goal, -20 as penalty for other states. Changing to 0 and -1 respectively gives 100% convergence. Here's the source of enlightenment

When the problem solver reaches a goal state, we can provide a fixed reward (e.g., zero) and terminate the search (i.e., the goal states are absorbing states). With this reward function, the cumulative reward of a policy is equal to the negative of the cost of solving the problem using that policy. Hence, the optimal policy will be the policy that solves the problem most efficiently.

Examples

Value Iteration Networks

There is this repo here

Exploring world with Value Iteration Network (VIN) One of the basic applications of the Value Iteration Network that learns how to find an optimal path between two points in the environment with obstacles.

from an article (2017 Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel - Value Iteration Networks)

We introduce the value iteration network (VIN): a fully differentiable neural network with a ‘planning module’ embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning.

VIN figure 1

Typical example of use

Here's a bold claim accusing RL in only maximizing one action over a smart strategy

[Typical NNs] are inherently reactive, and in particular, lack explicit planning computation. The success of reactive policies in sequential problems is due to the learning algorithm, which essentially trains a reactive policy to select actions that have good long-term consequences in its training domain.

So I skipped simple ANN implementations and tried VIN. I had git problems here. I found a fork with a fix. I also found that VIN is implemented in PyTorch and TensorFlow which means I don't need to use neupy library.

Questions

From the MDP definition I do not see if an agent can carry information obtained in one of the states, in case if we view a state as a position in the two-dimensional grid.

Because if we add that information to the state, then all states must include this extra level, so in another terms, the agent travels into another world, where the grid is the same, but it must learn transitions between them from scratch.

Here's an illustration of two connected worlds I found in the article: one without mail in the mailbox (M = false) and another when a postman delivered mail (M = true). Otherwise the loop is the same.

Illustration from Decision-Theoretic Planning

Rate this page