Workshop on Reinforcement Learning and Stochastic Control in Queues and Networks

Session ReStoq-1

Session 1

Conference
10:00 AM — 12:30 PM EDT
Local
Oct 18 Mon, 10:00 AM — 12:30 PM EDT

Approximate planning and learning for partially observed systems

Aditya Mahajan (McGill University, Canada)

0
Reinforcement learning (RL) provides a conceptual framework for designing agents which learn to act optimally in unknown environments. RL has been successfully used in various applications ranging from robotics, industrial automation, finance, healthcare, and natural language processing. The success of RL is based on a solid foundation of combining the theory of exact and approximate Markov decision processes (MDPs) with iterative algorithms that are guaranteed to learn an exact or approximate action-value function and/or an approximately optimal policy. However, for the most part, the research on RL theory is focused on systems with full state observations. In various applications including robotics, finance, and healthcare, the agent only gets a partial observation of the state of the environment. In this talk, I will describe a new framework for approximate planning and learning for partially observed systems based on the notion of approximate information state. The talk will highlight the strong theoretical foundations of this framework, illustrate how many of the existing approximation results can be viewed as a special case of approximate information state, and provide empirical evidence which suggests
that this approach works well in practice. Joint work with Jayakumar Subramanian, Amit Sinha, and Raihan Seraj.

Approximation Benefits of Policy Gradient Methods with Aggregated States

Dan Russo (Columbia University, USA)

0
Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by ϵ, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as ϵ/(1−γ), where γ is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming.

Towards an approximate information state for multi-agent RL problems

Vijay Subramanian (University of Michigan, USA)

0
Information asymmetry is a big challenge in multi-agent systems. Within the centralized training with distributed execution framework, we develop conditions that an approximate information state should satisfy so that low regret policies can be obtained. A key new feature that we highlight is the compression of actions of the agents (and accompanying contribution to regret) that occurs as a result of an approximate information state. This is joint work with Hsu Kao.

Compressive state representation learning towards small-data RL applications

Mengdi Wang (Princeton University, USA)

0
In this talk we survey recent advances on statistical efficiency and regret of reinforcement learning (RL) when good state representations are available. Motivated by the RL theory, we discuss what should be good state representations for RL and how to find compact state embeddings from high-dimensional Markov state trajectories. In the spirit of diffusion map for dynamical systems, we propose an efficient method for learning a low-dimensional state embedding and capturing the process's dynamics. State embedding can be used to cluster states into metastable sets predict future dynamics, and enable generalizable downstream machine learning and reinforcement learning tasks. We demonstrated applications of the approach in games, clinical pathway optimization, single-cell biology and identification of gene markers for drug discovery.

Online Reinforcement Learning for MDPs and POMDPs via Posterior Sampling

Rahul Jain (University of Southern California, USA)

0
Online learning in dynamic contexts has long been a goal of Reinforcement Learning (RL). And yet, most RL algorithms are really offline algorithms. There are two design methodologies for designing Online RL algorithms: optimism-based and posterior-sampling (PS). In contrast to optimism-based algorithms, posterior sampling-based algorithms tend to be much simpler (and hence, more elegant), have superior numerical performance and often simpler regret analysis as well. In this talk, we will explore PS-based online RL algorithms for infinite-horizon average-reward MDPs, random horizon (SSP) MDPs, and partially observed MDPs. In each case, we will demonstrate superior numerical performance (where relevant) and establish upper bounds on expected regret. In all cases, the algorithms are simple, natural (and in some sense interpretable), and require only one aspect (the ÒepochÓ length) to be designed. In some sense, this places PSRL algorithms on equal footing with Optimism-based RL algorithms. This is based on joint work with Mehdi Jafarnia, and other collaborators.

Session Chair

Vijay Subramanian (University of Michigan, USA)

Enter Zoom
Session ReStoq-break

Virtual Lunch Break

Conference
12:30 PM — 2:00 PM EDT
Local
Oct 18 Mon, 12:30 PM — 2:00 PM EDT

Enter Zoom
Session ReStoq-2

Session 2

Conference
2:00 PM — 4:30 PM EDT
Local
Oct 18 Mon, 2:00 PM — 4:30 PM EDT

Learning algorithm for optimal network control

Eytan Modiano (MIT, USA)

0
We discuss the role of learning in optimal network control, with focus on network stability in networks with uncooperative nodes, whose dynamics cannot be controlled and only sporadically estimated. It is well known that the MaxWeight algorithm can stabilize the network when all of the nodes are fully cooperative. We show that MaxWeight fails when some of the nodes are uncooperative, and introduce Tracking-MaxWeight (TMW), which enhances the original algorithm with an explicit learning of the policy used by uncooperative nodes. We show that TMW achieves stability as long as the state of uncooperative nodes can be sporadically estimated, and characterize the impact of infrequent and erroneous estimates on stability. Next, we consider the scenario where uncontrollable nodes use a queue-dependent policy and seek to minimize queue backlogs (network delays). We formulate the problem as an MDP with unknown queueing dynamics, and propose a new reinforcement learning algorithm, called Truncated Upper Confidence Reinforcement Learning (TUCRL), that can achieve optimal performance. Finally, we propose a new framework for optimal reinforcement learning for queueing networks with unbounded state-spaces.

Data-Driven Stochastic Network Control via Reinforcement Learning

Qiaomin Xie (University of Wisconsin–Madison, USA)

0
Finding a good control policy of stochastic queueing networks is crucial for achieving desirable performance (e.g., high throughput or low delay). These stochastic control problems have been recognized to be notoriously difficult. In this talk, we consider a new paradigm of data-driven approaches that learn network control policies using Reinforcement Learning (RL). Specifically, we study the problem of RL with unbounded state spaces --- a ubiquitous feature of queueing networks. Existing approaches and performance metrics, mostly designed for finite or compact state spaces, are not well suited for this setting. Inspired by the literature in queuing systems and control theory, we propose a notion of stability as a performance metric for RL policies: the state dynamics under the policy should remain in a bounded region with high probability. Building on our prior work on Monte Carlo tree search, we develop an RL policy that provably achieves the stability property whenever the optimal policy respects a Lyapunov function. Moreover, we design an adaptive version of the algorithm, based on carefully constructed statistical tests, which does not require knowledge of the Lyapunov function nor its drift parameter.
Given a stable policy, we further develop a model-based RL method and prove that it converges to the optimal policy. Our method demonstrates promising performance in a variety of network control problems including routing, dynamic server allocation and switch scheduling.

Job Dispatching Policies for Queueing Systems with Unknown Service Rates

Weina Wang (Carnegie Mellon University, USA)

0
In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. In this work, we tackle the problem of job dispatching without the knowledge of service rates and queue lengths, where the dispatcher can only obtain noisy estimates of the service rates by observing job departures. This problem presents a novel exploration-exploitation trade-off between sending jobs to all the servers to estimate their service rates, and exploiting the currently known fastest servers to minimize the expected queueing delay. We propose a bandit-based exploration policy that learns the service rates from observed job departures. Unlike the standard multi-armed bandit problem where only one out of a finite set of actions is optimal, here the optimal policy requires identifying the optimal fraction of incoming jobs to be sent to each server. We present a regret analysis and simulations to demonstrate the effectiveness of the proposed bandit-based exploration policy. This is joint work with Tuhinangshu Choudhury and Gauri Joshi.

Learning based meta-scheduling in wireless networks

Sanjay Shakkottai (The University of Texas at Austin, USA)

0
In this talk, we discuss learning-assisted multi-user scheduling and slicing for the wireless downlink. There have been many scheduling algorithms developed that optimize for a plethora of performance metrics; however a systematic approach across diverse performance metrics and deployment scenarios is still lacking. We address this by developing a multi-armed bandit based meta-scheduler Ð given a diverse collection of schedulers, we develop a learning-based overlay algorithm (meta-scheduler) that selects that Òbest" scheduler from amongst these for each deployment scenario. Furthermore, we build on this approach to develop bandit algorithms for network slicing. Based on joint work with Jianhan Song and Gustavo de Veciana.

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Lei Ying (University of Michigan, USA)

0
Traditional reinforcement learning aims at maximizing the expected cumulative reward, but in practice, many applications need to be operated under a variety of operational constraints. This talk introduces a model-free approach for constrained reinforcement learning, which ensures operational constraints, such as safety and fairness, during both learning and decision making.

Session Chair

Gauri Joshi and Weina Wang (Carnegie Mellon University, USA)

Enter Zoom

Made with in Toronto · Privacy Policy · WiOpt 2020 · © 2021 Duetone Corp.