Reinforcement learning

Overview

SS: observation, aa: action, θ\theta: parameter, GG: cumulated reward, π\pi: policy(network), rr: reward

τ\tau: trajectory, Aθ(st,at)A^{\theta}(s_t, a_t): advantage function

τ={s1,a1,s2,a2,,sT,aT}\tau = \{s_1, a_1, s_2, a_2, …, s_T, a_T\}

pθ(τ)=p(s1)Πpθ(atst)p(st+1st,at)p_{\theta}(\tau) = p(s_1) \Pi p_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t)

R(τ)=t=1TrtR(\tau) = \sum_{t=1}^Tr_t (random)

Rˉθ=τR(τ)pθ(τ)\bar R_{\theta} = \sum_{\tau} R(\tau)p_{\theta}(\tau) (Expected Reward) cannot calculate → sampling

Reward delay: sacrifice immediate reward to gain long-term reward

  • Cumulated reward
    Cumulated reward(based on distance)

    G=γntrnG = \sum \gamma^{n-t} r_n (γ<1\gamma < 1)

  • Policy gradient

    Rˉθ=τR(τ)pθ(τ)\nabla \bar R_{\theta} = \sum_{\tau} R(\tau) \nabla p_{\theta}(\tau) f(x)=f(x)logf(x) \nabla f(x) = f(x) \nabla \log f(x)

    Rˉθ=τR(τ)pθ(τ)pθ(τ)pθ(τ)\nabla \bar R_{\theta} = \sum_{\tau} R(\tau) p_{\theta}(\tau) \dfrac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)}

    Rˉθ=τR(τ)pθ(τ)logpθ(τ)\nabla \bar R_{\theta} = \sum_{\tau} R(\tau) p_{\theta}(\tau) \nabla \log p_{\theta}(\tau)

    Rˉθ=1NnNtTnR(τ)pθ(τn)logpθ(atnstn)\nabla \bar R_{\theta} = \dfrac{1}{N}\sum_{n}^N\sum_{t}^{T_n} R(\tau) p_{\theta}(\tau^n) \nabla \log p_{\theta}(a_t^n|s_t^n)

    θ1=θ0+ηRˉθ\theta^1 = \theta^0 + \eta \nabla \bar R_{\theta}

    Add baseline to avoid always positive

    Rˉθ=1NnNtTn(R(τ)b)pθ(τn)logpθ(atnstn)\nabla \bar R_{\theta} = \dfrac{1}{N}\sum_{n}^N\sum_{t}^{T_n} (R(\tau) - b) p_{\theta}(\tau^n) \nabla \log p_{\theta}(a_t^n|s_t^n)

    Assign suitable credit

    R(τn)=tTnγttrtnR(\tau^n) = \sum_{t^\prime}^{T_n} \gamma^{t^{\prime}-t}r_{t^\prime}^n discount factor γ<1\gamma < 1, futher the smaller impact

    • On-policy: an actor for training & interacting is the same

    • Off-policy: an actor for training & interacting is different

    • Proximal policy optimization(PPO)

    Exploration: the actor needs randomness during data collection

  • Actor-critic

    Value function Vθ(s)V^{\theta}(s):

    • Monte-Carlo based approach(MC)
    • Temporal-difference approach(TD)

      𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1𝑠_𝑡,𝑎_𝑡,𝑟_𝑡,𝑠_{𝑡+1}

      𝑉𝜃(𝑠𝑡)=𝑟𝑡+𝛾𝑟𝑡+1+𝛾2𝑟𝑡+2𝑉^𝜃 (𝑠_𝑡)=𝑟_𝑡+𝛾𝑟_{𝑡+1}+𝛾^2 𝑟_{𝑡+2}…

      𝑉𝜃(𝑠𝑡+1)=𝑟𝑡+1+𝛾𝑟𝑡+2+𝑉^𝜃 (𝑠_{𝑡+1})=𝑟_{𝑡+1}+𝛾𝑟_{𝑡+2}+…

      𝑉𝜃(𝑠𝑡)=𝛾𝑉𝜃(𝑠𝑡+1)+𝑟𝑡𝑉^𝜃 (𝑠_𝑡 )=𝛾𝑉^𝜃 (𝑠_{𝑡+1})+𝑟_𝑡

    At=GtVθ(st)A_t = G_t^\prime - V^{\theta}(s_t)

    At=rt+Vθ(st+1)Vθ(st)A_t = r_t +V^{\theta}(s_{t+1}) - V^{\theta}(s_t) =Qπθ(st,at)Vθ(st)= Q^{\pi \theta} (s_t, a_t) - V^{\theta}(s_t)

    Shared parameter

  • Reward shaping
    Define extra reward to guide agents

    Curiosity based → see new meaningful things

  • No reward
    Learn from demonstration
    • Imitation learning
    • Inverse reinforcement learning(IRL)

      R(τˆn)>R(τ)\sum R(\^\tau_n) > \sum R(\tau)

      Actor → Generator, Reward function → Discriminator

  • Q-learning

    E[Gtn]=Qπθ(stn,atn)\mathbb E[G_t^n] = Q^{\pi \theta} (s_t^n, a_t^n)

Environment

environment state: invisible to the agent

Reward
Specify the goal

R(x)=L(x)R(x) = -L(x)

L(x)=AnenL(x) = \sum A_n e_n

Agent
  • Agent state StS_t
    a function of history. agent’s actions depend on state

    St+1=u(St,At,Rt+1,Ot+1)S_{t+1} = u(S_t, A_t, R_{t+1}, O_{t+1})

    • History: full sequence of observations, actions, rewards. Used to construct agent state.

    Ht=O0,A0,R1,O1,,H_t = O_0, A_0, R_1, O_1, …, At1,Rt,OtA_{t-1}, R_t, O_t

    • Markov Decision Process: adding more history does not help.

      StS_t = OtO_t = Environment state (sees everything)

    💡
    t: time step

    u(): state update function
    • Last observation: St=OtS_t = O_t (not enough)
    • Complete history: St=HtS_t = H_t (too large)
    • A generic update: St=u(St1,At1,Rt,Ot)S_t = u(S_{t-1}, A_{t-1}, R_t, O_t)
  • Policy
    define agent’s behavior. StAtS_t → A_t

    Deterministic policy: A=π(S)A = \pi(S)

    Stochastic policy: π(AS)=P(AS)\pi(A|S) = P(A|S)

  • Value functions
    select between actions, evaluate desirable state.

    vπ(s)=E[GtSt=s,π]v_{\pi}(s) = \mathbb{E}[G_t | S_t = s, \pi]

    =E[Rt+1+γRt+2+(γ)2Rt+3+St=s,π]= \mathbb{E}[R_{t+1} + \gamma R_{t+2} + (\gamma)^2 R_{t+3} + … | S_t = s, \pi]

    • Depends on policy
    • Recursive form Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}
      • Bellman equation

        vπ(s)=E[Rt+1+γGt+1St=s,At=π(s)]v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = \pi(s)]

        =E[Rt+1+γvπ(St+1)St=s,At=π]= \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = \pi]

    💡
    discount factor γ[0,1]\gamma \in [0, 1]
    immediate vs long-term
    rewards trade-off
  • Model
    Predict what the environment will do next
    • P\mathcal{P} predict next state:

      P(s,a,s)p(St+1=sSt=s,At=a)\mathcal{P}(s, a, s’) \approx p(S_{t+1} = s’ | S_t = s, A_t = a)

    • R\mathcal{R} predict next (immediate) reward:

      R(s,a)E[Rt+1St=s,At=a]\mathcal{R}(s, a) \approx \mathbb{E}[R_{t+1} | S_t = s, A_t = a]

  • categories
    • Value based: value function
    • Policy based: policy
    • Actor critic: value function + policy

    • Model free: No model
    • Model based: model

    • Learning: environments are initially unknown
    • Planning: environments are known

    • Prediction: predict the future
    • Control: optimize the future