Delay and cooperation in nonstochastic linear bandits

Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, Ken Ichi Kawarabayashi

研究成果: Conference article査読

1 被引用数 (Scopus)

抄録

This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decision-making problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed d rounds after the choice of an action, and propose an algorithm for which the expected regret is Õ(pm(m + d)T), ignoring logarithmic factors in m and T, where m and T denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of O(pm(m + d)T) in the worst case. To develop the algorithm, we introduce a technique we refer to as distribution truncation, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [18] and Bar-On and Mansour [12], and extend their results to the linear bandits setting.

本文言語English
ジャーナルAdvances in Neural Information Processing Systems
2020-December
出版ステータスPublished - 2020
イベント34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
継続期間: 2020 12 62020 12 12

ASJC Scopus subject areas

  • コンピュータ ネットワークおよび通信
  • 情報システム
  • 信号処理

フィンガープリント

「Delay and cooperation in nonstochastic linear bandits」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル