Reinforcement Learning

Reinforcement Learning (RL), a subset of Machine Learning, has made a remarkable mark in today’s technological world by encouraging automated decision-making.

Here’s a list of things we’ll cover to get a better understanding of RL:

What is RL?
Need of RL.
Comparison with other Machine learning techniques.
Applications
Benefits and challenges with RL.
Formulate basic problems.
Painting the future with RL.

What is Reinforcement Learning?

Reinforcement Learning is a sub-branch of machine learning that trains a model to return an optimum solution for a problem by taking a
sequence of decisions by itself.

It is also defined as a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error
using feedback from its actions and experiences.

Need of Reinforcement Learning

A major drawback of machine learning is that a tremendous amount of data is needed to train models.

When the model is more complex, more data is required.
Learning from a small subset of actions will not help expand the vast domain of solutions that may work for a particular problem.

All of these problems are overcome by RL. In RL, the model is developed to a controlled environment which is modeled after the problem
statement to be solved instead of using actual data to solve it.

Understanding the terms in Reinforcement Learning

Consider a scenario with a dog, a ball and there is a cookie or biscuit for a dog. When a dog fetches or plays with a ball and if doing his
work properly, he will get a cookie. This is a scenario where the dog has to be trained and the solution needs to be found. For this, formal terms in RL are mapped with this scenario.

The agent is a model which is trained by RL. Here, the dog is an agent trained using RL.

The environment is a training situation that the model must optimize. Here, the place where the dog is located is known as the environment.

Actions are all possible steps taken by the model. If the dog is an agent, some steps taken by the dog for fetching is an action.

The state is a current position or condition returned by the model. Here, the dog’s current position is how it fetches a ball or its behavior which is a state of the model.

If the ball is fetched within that predefined environment properly, the dog will get a biscuit as a reward. This is called rewarded point which is given at appropriate action.

No external data is provided here for training the model. The aim is to understand the problem statement, perform different trials and errors, train and model with better accuracy which is called RL.

Consider an example. Suppose the dog needs to be house trained. For this problem statement, RL develops a model. Here, the dog is an
agent and the house is an environment.

The dog will be trained and if it performs well, a biscuit or cookie will be rewarded. The dog will follow a policy to maximize its reward and hence will follow every comment and might even learn new actions like tilting their Heads, handshake, etc., all by itself.

The dog will also run around and play and explore its environment. This quality of a model is called Exploration.

The tendency of the dog to maximize rewards is called Exploitation. To maximize rewards, the dog may stand on the couch which is not
a trained activity. This indicates that the model is not performing correctly. A dog explores a new solution but it is not an expected action. So, this results in the exploitation of maximization of rewards.

There is always a trade-off between exploration and exploitation, as exploration may lead to lesser rewards.

Comparison of RL with other machine learning techniques

Following are types of ML techniques:

Supervised Learning – Labelled data as input and it is task-driven.
Unsupervised Learning –Unlabelled data as input and it is data-driven.
Reinforcement Learning – The agent is trained in an environment that learns from mistakes to find a solution.

Supervised Learning	Unsupervised Learning	Reinforcement Learning
Data provided is labeled data with output values specified.	Data provided is unlabelled data with output values are not specified, makes its prediction	The machine learns from its environment using rewards and errors
Used for solving regression and classification problem	Used to solve association and clustering problem	Used to solve reward based problem
Labeled data is used	Unlabelled datais used	No predefined data is used
External supervision	No supervision	No supervision
Solves problem by mapping labeled input to the known output	Solves problem by understanding patterns and discovering output	Follows trial and error problem-solving approach

Applications of RL

Control means a decision taken based on performing the task. For example, adopting the factory process in telecommunication using RL.
Chemistry –The chemical reaction is optimized using RL to prepare medicine for patients.
Business – To earn money, business planning is required which can be implemented using RL.
Manufacturing – In manufacturing, autonomous robots can be used as an agent which can perform picking up the goods at some activities where people cannot perform. Here, programming is done using RL. This can also be used in healthcare.
Finance sector – In the stock market prediction, evaluation and optimization are done using RL.
Game playing – RL can be used for determining the next move in-game depending on various factors.

Benefits of RL

RL focuses on the problem as a whole, the task is not sub-divided. It will sense data from the environment and model the agent to maximize the rewards. But machine learning divides tasks as subtasks.
It is capable of holding short-term rewards to benefit long-term rewards and this doesn’t need a separate data collection step.
Agents learn from the environment and no training step is required. This reduces the burden of the model.
RL can work in a dynamic, uncertain environment.

Challenges of RL

The agent needs more processing time and more extensive experience.
Delayed rewards. Short rewards should be discovered on optimum policy for long-term rewards.
Lack of interpretability in high-risk environment between agent and observer.

Basic problem formulation

Formulate problem – Define the task or agent to learn including how the agent interacts with the environment with their primary and secondary goal, the agent must achieve or receive.
Create Environment – Defining Environment in which the agent operates including the interface between agent and environment.
Define Reward – It denotes what type of incentive will be provided if the task is achieved. Specify the Rewards that the agent uses to major its performance against the task, goal and how the rewards are calculated from the environment. It might be a positive reward, negative reward depending on the task achieved.
Create Agent – create an agent which includes defining a policy representation and configuring an agent learning algorithm.
Training Agent – train agent policy representation using the defined environments, rewards, and agent learning algorithm.
Validate Agent – Evaluate the performance of the trained agent by simulating the agent and the environment together.
Deploy Policy – Deploy the trained agent or policy representation with simulation and specific hardware.

Example:

Step 1- Formulate the problem

Consider an example of how babies/ Children learn to walk. Similar to this create or implement a robot /agent.

Step 2- Create Environment

Consider one environment with house, couch and another with house, couch, and obstruction (Table).

Step 3-Define Reward

There are two rewards positive and negative rewards. If the baby starts walking and successfully reaches the couch, the reward is positive. Otherwise, the baby gets hurt and the reward is negative.

Step 4- create an agent

Select the appropriate training algorithm since different representations are often tried for a specific type of training algorithm. But
in general most modern algorithms rely on neural networks because they are good candidates for large states or action space for a complex problem.
Choosing a way to represent policy (i.e., using neural networks)
Select appropriate training algorithms. For RL we have a value-based policy and a model-based policy. Value-based policy maximizes reward, whereas model-based policy provides the agents to perceive from the environment and maximize the experience and reward.

Step 5 & 6 – Train and validate agent

Setup training options (Eg: stopping criteria) and train agent to tune the policy (provide epochs to execute the program)

Step 7-Deploy policy

Deploy the trained policy representation.

Example of Reinforcement Learning- To drive a car

Exploration Episode 1

The control policy is initialized with random parameters
During training, the car explores with random actions.
When the algorithm makes a mistake, the safety driver intervenes.

Exploration Episode 2

The algorithm gets rewarded for distance traveled before intervention.
When the driver resets the car, the policy is optimized. All optimization is done onboard the car.

8X Speed Exploration Episode 3

In evaluation training episode 1, the model is yet to learn to drive. Here 9.8 meters is covered as a reward.
In evaluation episode 2, the model has learned to correct a little but is still unstable. The model is a deep convolutional neural network.
The model input is the single monocular image. The model outputs steering angle and speed. Here 53.8 meters is covered as a reward.
After 11 training episodes algorithm has learned to follow the lane. This experiment is repeated with different weather conditions. This shows RL can learn to drive without hand-coding rules or maps.

Painting the Future with RL

It is predicted that reinforcement learning going to dominant the future of the technology world. Deep reinforcement learning uses
neural networks with RL to make more effective solutions.

Benefits

Improve accuracy of predictions by enabling improved data-driven decisions.
Learn from unstructured and unlabelled datasets, which enables the analysis of unstructured data.

To know about Upcoming AI based Technologies in Research, click here