Reinforcement Learning

Published:
Published:

Personal research

Link

Articles

Questions

  • Does anyone know if the reward function in reinforcement learning was ever supported by logical reasoning instead of a priori given values?

Program

I read DQN Paper and implement something better than any Python code.

Let's write Reinforcement Learning program in Rust. In the beginning we will be solving frost lake environment. We will use a grid world from rsrl

cargo new qu-robot
cd qu-robot
cargo add rsrl

Walls in the grid environmnt is an interesting problem. In rsrl crate it was implemented that the movement is possible but not movement actually occurs, so the agent just stays in the same position. It's not good. Because stay in place should be intentional action rather than a loophole in the implementation. Staying in the same spot is an equlibrium, so we can say it's safe strategy.

What can we do to fix it? We can remove movements that go into any wall from a list of possible actions. But then the sum of probabilities for remaining actions stops being equal to 1. And this can be solved by normalization, but I have more important vote against it. I think that the agent should understand something about the environment and know what walls are. As thy can be the limits of the environment and also any obstacle inside the environment. And if the agent finds a way to avoid the obstacles then it can apply the same technique to the limits of the environment and leave the box...

Then if we add another state 'W' everywhere around our current grid then we emphasize the problem that we had originally. Every decision takes in consideration only the current state. If this state was near a wall then we learned that that direction is bad. Instead of just observing what is near our agent is taking actions like a blind hedgehog in the dark. So then we need to combine all cells around into one state.

But this is just very exponentially increased our set of possible states and dimensions of transition matrix. Then let's only include states that have reward values. That's not every state, right?

We skip diogonal values. Thus we have 5 states that will create one meta state. Having only 3 meaningful states we have 5^3 = 125 meta states.

Rate this page