However, it has long been known that the reinforcement principle offers at best an incomplete account of learned action
choice. Evidence from reward devaluation studies suggests that animals can also make “goal-directed” choices, putatively controlled by representations of the likely outcomes of their actions (Dickinson and PKC inhibitor Balleine, 2002). This realizes a suggestion, dating back at least to Tolman (1948), that animals are not condemned merely to repeat previously reinforced actions. From the perspective of neuroscience, habits and goal-directed action systems appear to coexist in different corticostriatal circuits. While these systems learn concurrently, they control behavior differentially under alternative circumstances (Balleine and O’Doherty, 2010, Dickinson, 1985 and Killcross and Coutureau,
2003). Computational treatments (Balleine et al., TSA HDAC in vivo 2008, Daw et al., 2005, Doya, 1999, Niv et al., 2006 and Redish et al., 2008) interpret these as two complementary mechanisms for reinforcement learning (RL). The TD mechanism is associated with dopamine and RPEs, and is “model-free” in the sense of eschewing the representation of task structure and instead working directly by reinforcing successful actions. The goal-directed mechanism is a separate “model-based” RL system, which works by using a learned “internal model” of the task to evaluate candidate actions (e.g., by mental simulation; Hassabis and Maguire, 2007 and Schacter et al., 2007; perhaps implemented by some form of preplay; Foster and Wilson, 2006 and Johnson and Redish, 2007). Barring one recent exception (Gläscher et al., 2010) (which focused on the different issue of the neural substrates of learning the internal model), previous studies investigating the neural substrates of model-free and
model-based control have not attempted either to detect simultaneous correlates of both as these systems learn concurrently. Thus, the way the controllers interact is unclear, and the prevailing supposition that neural RPEs originate from a distinct model-free system remains untested. Here we exploited the difference between their two types of action evaluation to investigate the interaction of the controllers in humans quantitatively, using functional MRI (fMRI). Model-free evaluation is retrospective, chaining RPEs backward across a sequence of actions. By contrast, model-based evaluation is prospective, directly assessing available future possibilities. Thus, it is possible to distinguish the two using a sequential choice task. In theory, the choices recommended by model-based and model-free strategies depend on their own, separate valuation computations. Thus, if behavior reflects contributions from each strategy, then we can make the clear, testable prediction that neural signals reflecting either valuation should dissociate from behavior (Kable and Glimcher, 2007).