Feature Construction for Inverse Reinforcement Learning

Feature Construction for Inverse Reinforcement Learning

Supplementary Videos

Sergey Levine
Stanford University

Zoran Popović
University of Washington

Vladlen Koltun
Stanford University

This webpage provides the supplementary videos for the NIPS 2010 paper "Feature Construction for Inverse Reinforcement Learning," followed by a supplementary comparison with the MMPBoost boosting-based feature construction algorithm. The paper can be viewed here.

1. Supplementary Videos

The supplementary videos below show the policies learned by FIRL, MMP, and Abbeel & Ng on the highway environment, as well as the expert policy. The summary video provides a concise summary of all the learned policies. All policies are demonstrated on a novel stretch of highway that is distinct from the one used in training.

The videos are provided below using Flash. They are also available as high quality DivX avi files here.

Summary
Lawful - Expert	Lawful - FIRL
Lawful - Abbeel & Ng	Lawful - MMP
Outlaw - Expert	Outlaw - FIRL
Outlaw - Abbeel & Ng	Outlaw - MMP

2. Supplementary Comparison with MMPBoost

MMPBoost [1] is a an inverse reinforcement learning algorithm based on Maximum Margin Planning [2] that first learns a reward function as a linear combination of provided features, and then constructs additional features by training classifiers on existing features ("boosting"). The classifiers are trained to label states on the example paths as negative, and states on the path obtained from the optimal policy under the current loss-augmented reward function as positive. Thus, the new features attempt to maximally distinguish between the example policy and the optimal policy under the current reward function.

Unlike FIRL, MMPBoost does not perform feature selection, and so has no inherent mechanism for distinguishing relevant features from irrelevant ones. However, it is possible to modify the algorithm to perform rudimentary feature selection, by only providing it with a subset of features in the first iteration, but allowing any of the features to be boosted in as part of a classifier. To make maximal use of selection, we initialize this version of MMPBoost with only one feature. When initialized with the feature that indicates a speed of 1, we will refer to this method as MMPBoost (S), and when initialized with the indicator for the middle lane, we will denote it MMPBoost (L). The speed 1 feature is most often violated by the other algorithms, which erronously drive at a speed of 1, so we expect this method to yield the best results.

We ran MMPBoost and MMPBoost (S/L) on the same highway environment. As in [1], we used fixed-depth decision trees for classification to construct features. We experimented with a variety of settings, and found that representative results could be obtained with trees of depth 5, running MMPBoost for 8 iterations, and MMPBoost (S/L) for 16 (more iterations are needed, since the algorithm begins with fewer features). The complete table of results, including other algorithms we discuss in the paper, is presented below.

	"Lawful" policies			"Outlaw" policies
	percent misprediction	feature expectation distance	average speed	percent misprediction	feature expectation distance	average speed
Expert	0.0%	0.000	2.410	0.0%	0.000	2.375
FIRL	22.9%	0.025	2.314	24.2%	0.027	2.376
MMPBoost S	24.1%	0.088	2.410	24.8%	0.073	2.538
MMPBoost L	25.6%	0.086	1.503	25.6%	0.067	1.795
MMPBoost	27.2%	0.113	1.068	26.7%	0.090	1.056
MMP	27.0%	0.111	1.068	27.2%	0.096	1.056
A&N	38.6%	0.202	1.054	39.3%	0.164	1.055
Random	42.7%	0.220	1.053	41.4%	0.184	1.053

The results indicate that the standard version of MMPBoost does not perform significantly better than linear MMP on this problem. This is to be expected, since the number of irrelevant distractor features overwhelms the algorithm and prevents it from learning a reasonable reward function. Boosting in additional features does not help remedy this problem. Videos of the learned policy confirm this - like MMP, the MMPBoost policy drives only a short distance before slowing down to speed 1:

Lawful - MMPBoost

Outlaw - MMPBoost

The results for the selection version of the algorithm appear more interesting. The version initialized with the known, relevant speed feature appears to exceed the average speed of the expert's policy, though the feature expectation distance is still far higher than that of FIRL. Unfortunately, a visual examination of the policy indicates that, although the MMPBoost (S) policy does not slow down to speed 1, it also does not capture the critical elements of the expert's behavior. The lawful policy at times speeds in the right lane, and the outlaw policy frequently speeds near police cars:

Lawful - MMPBoost (S)

Outlaw - MMPBoost (S)

The results for MMPBoost (L) lie between MMPBoost and MMPBoost (S). Both policies still violate the key elements of the expert's behavior: the lawful policy drives at speed 4 in the right lane, while the outlaw policy often drives at speed 3 or 4 near police cars. The lawful policy also permanently slows down to speed 1 after driving some distance, like MMPBoost, although the outlaw policy does not:

Lawful - MMPBoost (L)

Outlaw - MMPBoost (L)

In conclusion, although the boosted version of MMP can construct additional features out of an existing set of features, it is not designed to perform feature selection. This prevents it from accurately reproducing expert policies in the presence of a large number of irrelevant distractor features, while FIRL can still reproduce the policy accurately. Although it may be possible to construct a version of MMPBoost that also performs selection, simply withholding potentially irrelevant features from the initial feature set is insufficient to accomplish this. Intuitively, the classifiers trained by MMPBoost consider only the example path and the current optimal path, while the FIRL optimization phase considers the entire state space. This allows it to consider the global implications of additional features, rather than only considering their effect on two different paths. A more thorough analysis of the theoretical relationship between the two algorithms would be an interesting avenue for future work.

References

[1] N. D. Ratliff, D. Bradley, J. A. Bagnell, and J. Chestnutt. Boosting structured prediction for imitation learning. In Advances in Neural Information Processing Systems 19, 2007.

[2] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In ICML '06: Proceedings of the 23rd International Conference on Machine Learning, pages 729-736. ACM, 2006.