Workshop on Reinforcement Learning and Stochastic Control in Queues and Networks

Session ReStoq-1

Session 1

Conference
10:00 AM — 12:30 PM EDT
Local
Oct 18 Mon, 10:00 AM — 12:30 PM EDT

Approximate planning and learning for partially observed systems

Aditya Mahajan

0
Reinforcement learning (RL) provides a conceptual framework for designing agents which learn to act optimally in unknown environments. RL has been successfully used in various applications ranging from robotics, industrial automation, finance, healthcare, and natural language processing. The success of RL is based on a solid foundation of combining the theory of exact and approximate Markov decision processes (MDPs) with iterative algorithms that are guaranteed to learn an exact or approximate action-value function and/or an approximately optimal policy. However, for the most part, the research on RL theory is focused on systems with full state observations. In various applications including robotics, finance, and healthcare, the agent only gets a partial observation of the state of the environment. In this talk, I will describe a new framework for approximate planning and learning for partially observed systems based on the notion of approximate information state. The talk will highlight the strong theoretical foundations of this framework, illustrate how many of the existing approximation results can be viewed as a special case of approximate information state, and provide empirical evidence which suggests
that this approach works well in practice. Joint work with Jayakumar Subramanian, Amit Sinha, and Raihan Seraj.

TBD

Dan Russo

0
TBD

Online Reinforcement Learning for MDPs and POMDPs via Posterior Sampling

Rahul Jain

0
Online learning in dynamic contexts has long been a goal of Reinforcement Learning (RL). And yet, most RL algorithms are really offline algorithms. There are two design methodologies for designing Online RL algorithms: optimism-based and posterior-sampling (PS). In contrast to optimism-based algorithms, posterior sampling-based algorithms tend to be much simpler (and hence, more elegant), have superior numerical performance and often simpler regret analysis as well. In this talk, we will explore PS-based online RL algorithms for infinite-horizon average-reward MDPs, random horizon (SSP) MDPs, and partially observed MDPs. In each case, we will demonstrate superior numerical performance (where relevant) and establish upper bounds on expected regret. In all cases, the algorithms are simple, natural (and in some sense interpretable), and require only one aspect (the ÒepochÓ length) to be designed. In some sense, this places PSRL algorithms on equal footing with Optimism-based RL algorithms. This is based on joint work with Mehdi Jafarnia, and other collaborators.

TBD

Mengdi Wang

0
TBD

TBD

Vijay Subramanian

0
TBD

Session Chair

Vijay Subramanian

Enter Zoom
Session ReStoq-2

Session 2

Conference
2:00 PM — 4:30 PM EDT
Local
Oct 18 Mon, 2:00 PM — 4:30 PM EDT

Learning algorithm for optimal network control

Eytan Modiano

0
We discuss the role of learning in optimal network control, with focus on network stability in networks with uncooperative nodes, whose dynamics cannot be controlled and only sporadically estimated. It is well known that the MaxWeight algorithm can stabilize the network when all of the nodes are fully cooperative. We show that MaxWeight fails when some of the nodes are uncooperative, and introduce Tracking-MaxWeight (TMW), which enhances the original algorithm with an explicit learning of the policy used by uncooperative nodes. We show that TMW achieves stability as long as the state of uncooperative nodes can be sporadically estimated, and characterize the impact of infrequent and erroneous estimates on stability. Next, we consider the scenario where uncontrollable nodes use a queue-dependent policy and seek to minimize queue backlogs (network delays). We formulate the problem as an MDP with unknown queueing dynamics, and propose a new reinforcement learning algorithm, called Truncated Upper Confidence Reinforcement Learning (TUCRL), that can achieve optimal performance. Finally, we propose a new framework for optimal reinforcement learning for queueing networks with unbounded state-spaces.

Data-Driven Stochastic Network Control via Reinforcement Learning

Qiaomin Xie

0
Finding a good control policy of stochastic queueing networks is crucial for achieving desirable performance (e.g., high throughput or low delay). These stochastic control problems have been recognized to be notoriously difficult. In this talk, we consider a new paradigm of data-driven approaches that learn network control policies using Reinforcement Learning (RL). Specifically, we study the problem of RL with unbounded state spaces --- a ubiquitous feature of queueing networks. Existing approaches and performance metrics, mostly designed for finite or compact state spaces, are not well suited for this setting. Inspired by the literature in queuing systems and control theory, we propose a notion of stability as a performance metric for RL policies: the state dynamics under the policy should remain in a bounded region with high probability. Building on our prior work on Monte Carlo tree search, we develop an RL policy that provably achieves the stability property whenever the optimal policy respects a Lyapunov function. Moreover, we design an adaptive version of the algorithm, based on carefully constructed statistical tests, which does not require knowledge of the Lyapunov function nor its drift parameter.
Given a stable policy, we further develop a model-based RL method and prove that it converges to the optimal policy. Our method demonstrates promising performance in a variety of network control problems including routing, dynamic server allocation and switch scheduling.

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Lei Ying

0
Traditional reinforcement learning aims at maximizing the expected cumulative reward, but in practice, many applications need to be operated under a variety of operational constraints. This talk introduces a model-free approach for constrained reinforcement learning, which ensures operational constraints, such as safety and fairness, during both learning and decision making.

Learning based meta-scheduling in wireless networks

Sanjay Shakkottai

0
In this talk, we discuss learning-assisted multi-user scheduling and slicing for the wireless downlink. There have been many scheduling algorithms developed that optimize for a plethora of performance metrics; however a systematic approach across diverse performance metrics and deployment scenarios is still lacking. We address this by developing a multi-armed bandit based meta-scheduler Ð given a diverse collection of schedulers, we develop a learning-based overlay algorithm (meta-scheduler) that selects that Òbest" scheduler from amongst these for each deployment scenario. Furthermore, we build on this approach to develop bandit algorithms for network slicing. Based on joint work with Jianhan Song and Gustavo de Veciana.

Job Dispatching Policies for Queueing Systems with Unknown Service Rates

Weina Wang

0
In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. In this work, we tackle the problem of job dispatching without the knowledge of service rates and queue lengths, where the dispatcher can only obtain noisy estimates of the service rates by observing job departures. This problem presents a novel exploration-exploitation trade-off between sending jobs to all the servers to estimate their service rates, and exploiting the currently known fastest servers to minimize the expected queueing delay. We propose a bandit-based exploration policy that learns the service rates from observed job departures. Unlike the standard multi-armed bandit problem where only one out of a finite set of actions is optimal, here the optimal policy requires identifying the optimal fraction of incoming jobs to be sent to each server. We present a regret analysis and simulations to demonstrate the effectiveness of the proposed bandit-based exploration policy. This is joint work with Tuhinangshu Choudhury and Gauri Joshi.

Session Chair

Gauri Joshi and Weina Wang

Enter Zoom

Made with in Toronto · Privacy Policy · WiOpt 2020 · © 2021 Duetone Corp.