北交大博士：強化學習與策略評估

知識 11-07

分享內容

強化學習(Reinforcement learning)在近幾年收到越來越多的關注，對於強化學習的理論探討也一直是研究熱點。這次分享，我們將一起探討強化學習的理論框架。在此基礎上，策略評估(policy evaluation)是強化學習中最基礎也是最重要的一個組成部分，其收斂性質的分析對於理解和改進這一類演算法非常重要。但是如果只停留在一些非常理想化的假設下，得到的結果往往難以令人信服。在這次要分享的一個工作中，我們將給出一類策略評估演算法在一些更貼近實際的假定下（RL天然的數據不獨立同分布性，步長多種設置方式等）的收斂速率分析結果，從而更加確切的回答了關於這一類演算法收斂性質的疑問，並且提供了解決類似問題的一個可用的理論工具。

建議預讀文獻

《Finite-sample analysis of proximal gradient td algorithms》

論文地址：http://chercheurs.lille.inria.fr/~ghavamza/my_website/Publications_files/uai15.pdf

In this paper, we show for the first time how gradientTD (GTD) reinforcement learning methodscan be formally derived as true stochastic gradientalgorithms, not with respect to their originalobjective functions as previously attempted, butrather using derived primal-dual saddle-point objectivefunctions. We then conduct a saddle-pointerror analysis to obtain finite-sample bounds ontheir performance. Previous analyses of this classof algorithms use stochastic approximation techniquesto prove asymptotic convergence, and nofinite-sample analysis had been attempted. Twonovel GTD algorithms are also proposed, namelyprojected GTD2 and GTD2-MP, which use proximal「mirror maps」 to yield improved convergenceguarantees and acceleration, respectively.The results of our theoretical analysis imply thatthe GTD family of algorithms are comparableand may indeed be preferred over existing leastsquares TD methods for off-policy learning, dueto their linear complexity. We provide experimentalresults showing the improved performanceof our accelerated gradient TD methods.

《A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation》

論文地址：http://papers.nips.cc/paper/3626-a-convergent-on-temporal-difference-algorithm-for-off-policy-learning-with-linear-function-approximation

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm. Our analysis proves that its expected update is in the direction of the gradient, assuring convergence under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

《Fast gradient-descent methods for temporal-difference learning with linear function approximation》

論文地址：https://sites.ualberta.ca/~szepesva/papers/GTD-ICML09.pdf

Sutton, Szepesvari and Maei (2009) recently in- ′troduced the first temporal-difference learning algorithmcompatible with both linear function approximationand off-policy training, and whosecomplexity scales only linearly in the size ofthe function approximator. Although their gradienttemporal difference (GTD) algorithm convergesreliably, it can be very slow compared toconventional linear TD (on on-policy problemswhere TD is convergent), calling into question itspractical utility. In this paper we introduce twonew related algorithms with better convergencerates. The first algorithm, GTD2, is derived andproved convergent just as GTD was, but uses adifferent objective function and converges significantlyfaster (but still not as fast as conventionalTD). The second new algorithm, linear TD withgradient correction, or TDC, uses the same updaterule as conventional TD except for an additionalterm which is initially zero. In our experimentson small test problems and in a ComputerGo application with a million features, thelearning rate of this algorithm was comparable tothat of conventional TD. This algorithm appearsto extend linear TD to off-policy learning withno penalty in performance while only doublingcomputational requirements.

分享提綱

1. 強化學習（RL）背景框架介紹和符號說明

2. 策略評估（policy evaluation）的常用方法介紹（如GTD 演算法）

3. 原有的GTD 演算法的收斂速率分析結果

4. 在一些更貼近實際的假定下（RL天然的數據不獨立同分布性，步長多種設置方式等），給出收斂性分析的結果

5. 總結與反思

分享主題

Reinforcement learning and policy evaluation

強化學習與策略評估

分享人簡介

汪躍，北京交通大學數學系三年級博士生，專業為概率論與數理統計，導師是馬志明院士。他的研究興趣在於機器學習、優化演算法、強化學習的演算法設計和演算法理論分析。在此之前，他於2015年在北京交通大學理學院院獲得學士學位。他現在微軟亞洲研究院機器學習組實習。

分享時間

北京時間11月8日（周三）晚20:00

參與方式

掃描海報二維碼添加社長微信，備註「強化學習」

如果你覺得活動不錯，歡迎點擊報名~

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 AI研習社 的精彩文章:

※如何有效處理特徵範圍差異大且類型不一的數據？
※伯克利客座教授：AlphaGo Zero and Deep Learning
※神經網路有什麼理論支持？
※神經網路中容易被忽視的基礎知識
※用於文檔理解的面向對象神經規劃

TAG:AI研習社 |