We use the following quadratic expansion of the costs, value-function and Q-function:
where
The m and v in Vm and vv etc. stand for 'matrix' and 'vector'.
The backward recursion function takes as arguments
prev_traj_distr: The policy object of the previous iteration, used get the dimensions of the new policy objecttraj_info: This object contains the dynamicseta: Lagrange dual variable; needed to compute the extended cost functiondef backward(self, prev_traj_distr, traj_info, eta):
We get the number of timesteps and the dimensions of the states and the actions from the previous policy object:
T = prev_traj_distr.T
dimU = prev_traj_distr.dU
dimX = prev_traj_distr.dX
We get the quadratic expansion of the cost function and the dynamics:
Cm_ext, cv_ext = self.compute_extended_costs(eta, traj_info, prev_traj_distr)
Fm = traj_info.dynamics.Fm
fv = traj_info.dynamics.fv
We use slice syntax so that Qm[index_x, index_u] means
index_x = slice(dimX)
index_u = slice(dimX, dimX + dimU)
We allocate space for the Value-function and initialize the new policy object:
Vm = np.zeros((T, dimX, dimX))
vv = np.zeros((T, dimX))
traj_distr = prev_traj_distr.nans_like()
We iterate over
for t in range(T - 1, -1, -1):
At each timestep, we compute the quadratic and the linear coefficients of the Q-Function:
At
Qm = Cm_ext[t, :, :]
qv = cv_ext[t, :]
if t < T - 1:
Qm += Fm[t, :, :].T.dot(Vm[t + 1, :, :]).dot(Fm[t, :, :])
qv += Fm[t, :, :].T.dot(vv[t + 1, :] + Vm[t + 1, :, :].dot(fv[t, :]))
Qm = 0.5 * (Qm + Qm.T)
Qm being not quite symmetric. To counter these numerical errors we symmetrize Qm.
Instead of directly computing
U = sp.linalg.cholesky(Qm[index_u, index_u])
L = U.T
We calculate traj_distr object:
traj_distr.K[t, :, :] = - sp.linalg.solve_triangular(
U, sp.linalg.solve_triangular(L, Qm[index_u, index_x], lower = True)
)
traj_distr.k[t, :] = - sp.linalg.solve_triangular(
U, sp.linalg.solve_triangular(L, qv[index_u], lower=True)
)
We store the covariance traj_distr object:
traj_distr.pol_covar[t, :, :] = sp.linalg.solve_triangular(
U, sp.linalg.solve_triangular(L, np.eye(dimU), lower=True)
)
traj_distr.chol_pol_covar[t, :, :] = sp.linalg.cholesky(
traj_distr.pol_covar[t, :, :]
)
traj_distr.inv_pol_covar[t, :, :] = Qm[index_u, index_u]
We calculate the quadratic and the linear coefficients of the Value-function, which are used in the next iteration:
Vm[t, :, :] = Qm[index_x, index_x] + \
Qm[index_x, index_u].dot(traj_distr.K[t, :, :])
Vm[t, :, :] = 0.5 * (Vm[t, :, :] + Vm[t, :, :].T)
vv[t, :] = qv[index_x] + Qm[index_x, index_u].dot(traj_distr.k[t, :])
After that, the loop ends.
Finally, we return the new traj_distr object:
return traj_distr