#111論文等共有 395 https://t.co/BANJB8JwpI [Machine Learning '92] Policy gradient methodの有名手法REINFORCEの原論文。この時点で収束証明は無い。文献を見るとTD法, Q-learning, anchor-criticの方が古い。
REINFARCE! Industrial-strength Monte Carlo Policy Gradient for Crime. @RichardSSutton Special dedication to Ronald J. Williams: "Simple statistical gradient-following algorithms for connectionist reinforcement learning", 1992, https://t.co/5CX8XAFpCg http
@IntuitMachine @skornblith @KordingLab @TonyZador For the saddle points vs local optima, e.g.: https://t.co/cHqxP7hUY2 For optimizing sufficient statistics of discrete variables, e.g.: https://t.co/MJb4vH3zRr
Loss function is from Williams in 1992: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning https://t.co/7ZwR5M192H #Shef2019
教科書だけじゃなく原典を読むの大事だな。 R. J. WILLIAMS, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Machine Learning, 8, 229-256 (1992) https://t.co/TYTedMpwST