Empirically Evaluating the Application of Reinforcement Learning to the Induction of Effective and Adaptive Pedagogical Strategies

Min Chi1, Kurt VanLehn2, Diane Litman3 and Pamela Jordan4

1 Human-Sciences and Technologies Advanced Research Institute(H-STAR), Stanford University minchi@stanford.edu 2 School of Computing and Informatics, Arizona State University Kurt.Vanlehn@asu.edu 3 Department of Computer Science, University of Pittsburgh litman@cs.pitt.edu 4 Department of Biomedical Informatics, University of Pittsburgh pjordan@pitt.edu

For many forms of e-learning environments, the system’s behavior can be viewed as a sequential decision process wherein, at each discrete step, the system is responsible for selecting the next action to take. Pedagogical strategies are policies used to decide what action to take when multiple actions are available. Ideally, an effective learning environment should craft and adapt its decisions to the student’s knowledge levels and needs. There is no well-established theory governing adaptive fine-grained decisions, and it has yet to be shown that either human tutors or computer tutors that mimic them have effective pedagogical strategies. In this project we focused on one form of highly interactive e-learning environment, Intelligent Tutoring systems (ITSs). In this talk, we present a Reinforcement Learning (RL) approach for inducing effective pedagogical strategies and empirical evaluations of the induced strategies.

This was a three-year project that can be divided into three stages, one stage per year. In each stage, a group of students was trained on Cordillera, a natural-language ITS for physics. All three groups followed the same procedure: completing a background survey, reading a textbook, taking a pre-test, training on Cordillera, and finally, taking a post-test. All three groups used the training problems and instructional materials but on different versions of Cordillera. The versions differed only in terms of the pedagogical policies employed for interactive tutorial decisions.

In Stage 1, Cordillera made interactive decisions randomly and we collected an exploratory corpus that examined the consequences of each tutorial decision with real students. The student group is thus referred to as the Exploratory Group. In order to differentiate this version of Cordillera from the ones used in subsequent studies, this version is referred to as Random-Cordillera.

In Stage 2, we tried our first round of policy induction. Then these RL-induced policies were empirically evaluated. More specifically, a model-based RL approach, Policy Iteration was applied to the Exploratory corpus to induce a set of pedagogical policies. Because we dichotomized the students’ NLGs into +100 and −100 as reward functions, the induced policies were referred to as Dichotic Gain (DichGain) policies. The induced DichGain policies were added to Cordillera and this version of Cordillera was named DichGain-Cordillera. Except for following different policies (random vs. DichGain), the remaining components of Cordillera, including the GUI interface, the training problems, and the tutorial scripts, were left untouched. DichGain-Cordillera’s effectiveness was tested by training a new group of college students. Results showed that although the DichGain policies generated significantly different patterns of tutorial decisions than the random policy, no significant overall difference was found between the two groups on the pretest, posttest, or the NLGs.

There are many possible explanations for the lack of difference in learning outcomes between the DichGain and Exploratory groups. We argue that the exploration of our RL approach in stage 2 was limited. As described above, applying RL to induce effective tutorial policies may not be a simple task for which we can plug a training corpus into a toolkit. Rather it depends on many factors, such as state feature choices, feature selection methods, and the definition of reward functions.

In stage 3, the approach to RL-related issues was greatly modified. We directly used students’ Normalized Learning Gain (NLG)×100 as the reward function. The induced set of tutorial policies is thus named Normalized Gain (NormGain) tutorial policies and the version of Cordillera was named NormGain-Cordillera. We again ran a new group of students, named the NormGain group, using the same educational materials as used by the previous two groups.

Overall, our results provide empirical evidence that, when properly applied, RL can be applied to reach rather complicated and important instructional goals such as learning and quantitatively and substantially improve the performance of a tutoring system. In this talk, we will describe our detailed methodology for using RL to optimize the pedagogical policies based on limited interactions with human students and then present empirical results from validating the induced policies on students.