In:
Operations Research, Institute for Operations Research and the Management Sciences (INFORMS)
Abstract:
Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state–action pairs are drawn from a generative model in each iteration), substantial progress has been made toward understanding the sample efficiency of Q-learning. Consider a γ-discounted infinite-horizon MDP with state space [Formula: see text] and action space [Formula: see text] : to yield an entry-wise ε-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of [Formula: see text], which fails to match existing minimax lower bounds. This gives rise to natural questions: What is the sharp sample complexity of Q-learning? Is Q-learning provably suboptimal? This paper addresses these questions for the synchronous setting: (1) when the action space contains a single action (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as [Formula: see text] (up to log factor); (2) when the action space contains at least two actions, we settle the sample complexity of Q-learning to be on the order of [Formula: see text] (up to log factor). Our theory unveils the strict suboptimality of Q-learning when the action space contains at least two actions and rigorizes the negative impact of overestimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be [Formula: see text] . Funding: Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research (ONR) [Grant N00014-22-1-2354] , and the National Science Foundation (NSF) [Grants CCF-2221009, CCF-1907661, DMS-2014279, IIS-2218713, and IIS-2218773]. Y. Wei is supported in part by the Google Research Scholar Award and the NSF [Grants CCF-2106778, DMS-2147546/2015447, and CAREER award DMS-2143215] . Y. Chi is supported in part by the ONR [Grants N00014-18-1-2142 and N00014-19-1-2404] and the NSF [Grants CCF-1806154, CCF-2007911, CCF-2106778, ECCS-2126634, and DMS-2134080] . Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2023.2450 .
Type of Medium:
Online Resource
ISSN:
0030-364X
,
1526-5463
DOI:
10.1287/opre.2023.2450
Language:
English
Publisher:
Institute for Operations Research and the Management Sciences (INFORMS)
Publication Date:
2023
detail.hit.zdb_id:
2019440-7
detail.hit.zdb_id:
123389-0
SSG:
3,2
Permalink