0

In the Schulman 2017 PPO Paper, there is a value function loss term in the final loss in equation 9, where they state that the value function loss is the MSE of the target value and predicted value.

My question is, how do you compute the $V_t^{Target}$ term? I'm guessing it's the return or collected sum of rewards. Would that be discounted like

$V_t^{target} = \sum_{i=t}^T \gamma^{(i-t)} r_i$,

or $V_t^{target} = \sum_{i=t}^T r_i$,

or neither?