5m ago

6 Views

0 Downloads

212.74 KB

10 Pages

Transcription

An Experimental and Theoretical Comparisonof Model Selection Methods Michael KearnsAT&T Bell LaboratoriesMurray Hill, New Jersey1Yishay MansouryTel Aviv UniversityTel Aviv, IsraelIntroductionIn the model selection problem, we must balance the complexity of a statistical model with its goodness of fit to thetraining data. This problem arises repeatedly in statistical estimation, machine learning, and scientific inquiry in general.Instances of the model selection problem include choosingthe best number of hidden nodes in a neural network, determining the right amount of pruning to be performed ona decision tree, and choosing the degree of a polynomial fitto a set of points. In each of these cases, the goal is not tominimize the error on the training data, but to minimize theresulting generalization error.Many model selection algorithms have been proposed in theliterature of several different research communities, too manyto productively survey here. (A more detailed history of theproblem will be given in the full paper.) Perhaps surprisingly,despite the many proposed solutions for model selection andthe diverse methods of analysis, direct comparisons betweenthe different proposals (either experimental or theoretical)are rare.The goal of this paper is to provide such a comparison,and more importantly, to describe the general conclusions towhich it has led. Relying on evidence that is divided betweencontrolled experimental results and related formal analysis,we compare three well-known model selection algorithms.We attempt to identify their relative and absolute strengthsand weaknesses, and we provide some general methods foranalyzing the behavior and performance of model selectionalgorithms. Our hope is that these results may aid the informed practitioner in making an educated choice of model This research was done while Y. Mansour, A. Ng and D. Ronwere visiting AT&T Bell Laboratories.y Supported in part by The Israel Science Foundation, administered by The Israel Academy of Science and Humanities, and by agrant of the Israeli Ministry of Science and Technology.z Supported by the Eshkol Fellowship, sponsored by the IsraeliMinistry of Science.Andrew Y. NgCarnegie Mellon UniversityPittsburgh, PennsylvaniaDana RonzHebrew UniversityJerusalem, Israelselection algorithm (perhaps based in part on some knownproperties of the model selection problem being confronted).The summary of the paper follows. In Section 2, we provide aformalization of the model selection problem. In this formalization, we isolate the problem of choosing the appropriatecomplexity for a hypothesis or model. We also introduce thespecific model selection problem that will be the basis forour experimental results, and describe an initial experimentdemonstrating that the problem is nontrivial. In Section 3, weintroduce the three model selection algorithms we examinein the experiments: Vapnik’s Guaranteed Risk Minimization(GRM) [11], an instantiation of Rissanen’s Minimum Description Length Principle (MDL) [7], and Cross Validation(CV).Section 4 describes our controlled experimental comparisonof the three algorithms. Using artificially generated data froma known target function allows us to plot complete learningcurves for the three algorithms over a wide range of samplesizes, and to directly compare the resulting generalization error to the hypothesis complexity selected by each algorithm.It also allows us to investigate the effects of varying other natural parameters of the problem, such as the amount of noise inthe data. These experiments support the following assertions:the behavior of the algorithms examined can be complex andincomparable, even on simple problems, and there are fundamental difficulties in identifying a “best” algorithm; thereis a strong connection between hypothesis complexity andgeneralization error; and it may be impossible to uniformlyimprove the performance of the algorithms by slight modifications (such as introducing constant multipliers on thecomplexity penalty terms).In Sections 5, 6 and 7 we turn our efforts to formal resultsproviding explanation and support for the experimental findings. We begin in Section 5 by upper bounding the error ofany model selection algorithm falling into a wide class (calledpenalty-based algorithms) that includes both GRM and MDL(but not cross validation). The form of this bound highlightsthe competing desires for powerful hypotheses and controlledcomplexity. In Section 6, we upper bound the additional error suffered by cross validation compared to any other modelselection algorithm. This quality of this bound depends onthe extent to which the function classes have learning curvesobeying a classical power law. Finally, in Section 7, we givean impossibility result demonstrating a fundamental handi-

cap suffered by the entire class of penalty-based algorithmsthat does not afflict cross validation. In Section 8, we weighthe evidence and find that it provides concrete arguments favoring the use of cross validation (or at least cause for cautionin using any penalty-based algorithm).2DefinitionsThroughout the paper we assume that a fixed boolean targetfunction f is used to label inputs drawn randomly according to a fixed distribution D. For any boolean functionh, we define the generalization error (h) f;D (h) Prx2D [h(x) 6 f (x)]. We use S to denote the random variable S hx1 ; b1 i; : : :; hxm ; bm i, where m is the sample size,each xi is drawn randomly and independently according toD, and bi f (xi ) ci , where the noise bit ci 2 f0; 1g is 1with probability ; we call 2 [0; 1 2) the noise rate. In thecase that 6 0, we will sometimes wish to discuss the generalization error of h with respect to the noisy examples, sowe define (h) Prx2D;c [h(x) 6 f (x) c], where c is thenoise bit. Note that (h) and (h) are related by the equality (h) (1 , ) (h) (1 , (h)) (1 , 2 ) (h) . Forsimplicity, we will use the expression “with high probability”to mean with probability 1 , over the draw of S , at a costof a factor of log(1 ) in the bounds — thus, our boundsall contain “hidden” logarithmic factors, but our handlingof confidence is entirely standard and will be spelled out inthe full paper.We assume a nested sequence of hypothesis classes (or models) F1 Fd . The target function f may ormay not be contained in any of these classes, so we definehd argminh2Fd f (h)g and opt (d) (hd ) (similarly, opt (d) (hd )). Thus, hd is the best approximation tof (with respect to D) in the class Fd , and opt (d) measuresthe quality of this approximation. Note that opt (d) is a nonincreasing function of d since the hypothesis function classesare nested. Thus, larger values of d can only improve thepotential approximative power of the hypothesis class. Ofcourse, the difficulty is to realize this potential on the basisof a small sample.With this notation, the model selection problem can be statedinformally: on the basis of a random sample S of a fixed sizem, the goal is to choose a hypothesis complexity d̃, and a hypothesis h̃ 2 Fd , such that the resulting generalization error (h̃) is minimized. In many treatments of model selection,including ours, it is explicitly or implicitly assumed that themodel selection algorithm has control only over the choiceof the complexity d , but not over the choice of the final hypothesis h̃ 2 Fd . It is assumed that there is a fixed algorithmthat chooses a set of candidate hypotheses, one from eachhypothesis class. Given this set of candidate hypotheses, themodel selection algorithm then chooses one of the candidatesas the final hypothesis.To make these ideas more precise, we define the trainingerror ˆ(h) ˆS (h) jfhxi; bi i 2 S : h(xi ) 6 bi gj m, andthe version space VS (d) VS S (d) fh 2 Fd : ˆ(h) minh0 2Fd f ˆ(h0)gg. Note that VS (d) Fd may contain morethan one function in Fd — several functions may minimizethe training error. If we are lucky, we have in our possessiona (possibly randomized) learning algorithm L that takes asinput any sample S and any complexity value d, and outputsa member h̃d of VS (d) (using some unspecified criterion tobreak ties if jVS (d)j 1). More generally, it may be thecase that finding any function in VS (d) is intractable, andthat L is simply a heuristic (such as backpropagation or ID3)that does the best job it can at finding h̃d 2 Fd with smalltraining error on input S and d. In either case, we defineh̃d L(S; d) and ˆ(d) ˆL;S (d) ˆ(h̃d ). Note that weexpect ˆ(d), like opt (d), to be a non-increasing function ofd — by going to a larger complexity, we can only reduce ourtraining error.We can now give a precise statement of the model selection problem. First of all, an instance of the model selection problem consists of a tuple (fFd g; f; D; L), wherefFdg is the hypothesis function class sequence, f is the target function, D is the input distribution, and L is the underlying learning algorithm. The model selection problemis then: Given the sample S , and the sequence of functions h̃1 L(S; 1); : : :; h̃d L(S; d); : : : determined bythe learning algorithm L, select a complexity value d suchthat h̃d minimizes the resulting generalization error. Thus, amodel selection algorithm is given both the sample S and thesequence of (increasingly complex) hypotheses derived by Lfrom S , and must choose one of these hypotheses.The current formalization suffices to motivate a key definitionand a discussion of the fundamental issues in model selection.We define (d) L;S (d) (h̃d ). Thus, (d) is a randomvariable (determined by the random variable S ) that givesthe true generalization error of the function h̃d chosen by Lfrom the class Fd . Of course, (d) is not directly accessibleto a model selection algorithm; it can only be estimated orguessed in various ways from the sample S . A simple butimportant observation is that no model selection algorithmcan achieve generalization error less than mind f (d)g. Thusthe behavior of the function (d) — especially the locationand value of its minimum — is in some sense the essentialquantity of interest in model selection.The prevailing folk wisdom in several research communitiesposits that (d) will typically have a global minimum that isnontrivial — that is, at an “intermediate” value of d awayfrom the extremes d 0 and d m. As a demonstrationof the validity of this view, and as an introduction to a particular model selection problem that we will examine in ourexperiments, we call the reader’s attention to Figure 1. Inthis model selection problem (which we shall refer to as theintervals model selection problem), the input domain is simply the real line segment [0; 1], and the hypothesis class Fd issimply the class of all boolean functions over [0; 1] in whichwe allow at most d alternations of label; thus Fd is the class ofall binary step functions with at most d 2 steps. For the experiments, the underlying learning algorithm L that we haveimplemented performs training error minimization. This is arare case where efficient minimization is possible; we havedeveloped an algorithm based on dynamic programming thatruns in linear time, thus making experiments on large samples feasible. The sample S was generated using the targetfunction in F100 that divides [0; 1] into 100 segments of equal

width 1 100 and alternating label. In Figure 1 we plot (d)(which we can calculate exactly, since we have chosen thetarget function) when S consists of m 2000 random examples (drawn from the uniform input distribution)corruptedby noise at the rate 0:2. For our current discussion itsuffices to note that (d) does indeed experience a nontrivialminimum. Not surprisingly, this minimum occurs near (butnot exactly at) the target complexity of 100.3Three Algorithms for Model SelectionThe first two model selection algorithms we consider aremembers of a general class that we shall informally refer to aspenalty-based algorithms (and shall formally define shortly).The common theme behind these algorithms is their attemptto construct an approximation to (d) solely on the basis ofthe training error ˆ(d) and the complexity d, often by tryingto “correct” ˆ(d) by the amount that it underestimates (d)through the addition of a “complexity penalty” term.In Vapnik’s Guaranteed Risk Minimization (GRM) [11], d ischosen according to the rule 1pd argmin f ˆ(d) (d m)(1 1 ˆ(d)m d)g (1)dwhere for convenience but without loss of generality we haveassumed that d is the Vapnik-Chervonenkis dimension [11,12] of the class Fd ; this assumption holds in the intervalsmodel selection problem. The origin of this rule can besummarized as follows: it has been shown [11] (ignoringlogarithmic factors) that for every d and for every h 2 Fd ,pd m ispan upper bound on j ˆ(h) , (h)jpand hence j ˆ(d) , (d)j d m. Thus, by simply adding d m to ˆ(d), weensure that the resulting sum upper bounds (d), and if weare optimistic we might further hope that the sum is in facta close approximation to (d), and that its minimization istherefore tantamount to the minimization of (d). The actualrule given in Equation (1) is slightly more complex than this,and reflects a refined boundpon j ˆ(d) , (d)j that varies fromd m for ˆ(d) close to 0 to d m otherwise.The next algorithm we consider, the Minimum DescriptionLength Principle (MDL) [5, 6, 7, 1, 4] has rather different origins than GRM. MDL is actually a broad class of algorithmswith a common information-theoretic motivation, each algorithm determined by the choice of a specific coding schemefor both functions and their training errors; this two-part codeis then used to describe the training sample S . To illustratethe method, we give a coding scheme for the intervals modelselection problem 2 . Let h be a function with exactly d alternations of label (thus, h 2 Fd ). To describe the behavior ofh on the sample S fhxi; biig, we can simply specify thed inputs where h switches value (that is, the indices i such1Vapnik’s original GRM actually multiplies the second terminside the argmin f g above by a logarithmic factor intended to guardagainst worst-case choices from VS (d). Since this factor rendersGRM uncompetitive on the ensuing experiments, we consider thismodified and quite competitive rule whose spirit is the same.2Our goal here is simply to give one reasonable instantiationof MDL. Other coding schemes are obviously possible; however,several of our formal results will hold for essentially all MDLinstantiations., that h(xi) 6 h(xi 1 )) 3 . This takes log, md bits; dividing bym to normalize, we obtain (1 m) log md H(d m) [2],where H( ) is the binary entropy function. Now given h,the labels in S can be described simply by coding the mistakes of h (that is, those indices i where h(xi ) 6 f (xi )),at a normalized cost of H( ˆ(h)). Technically, in the codingscheme just described we also need to specify the values ofd and ˆ(h) m, but the cost of these is negligible. Thus, theversion of MDL that we shall examine for the intervals modelselection problem dictates the following choice of d :d argmind fH( ˆ(d)) H(d m)g:(2)In the context of model selection, GRM and MDL can both beinterpreted as attempts to model (d) by transforming ˆ(d)and d. More formally, a model selection algorithm of theformd argmind fG( ˆ(d); d m)g(3 )shall be called a penalty-based algorithm 4 . Notice that anideal penalty-based algorithm would obey G( ˆ(d); d m) (d) (or at least G( ˆ(d); d m) and (d) would be minimizedby the same value of d).The third model selection algorithm that we examine has adifferent spirit than the penalty-based algorithms. In crossvalidation (CV) [9, 10], we use only a fraction (1 , ) ofthe examples in S to obtain the hypothesis sequence h̃1 2F1; : : :; h̃d 2 Fd ; : : : — that is, h̃d is now L(S 0 ; d), where S 0consists of the first (1 , )m examples in S . Here 2 [0; 1]is a parameter of the CV algorithm whose tuning we discussbriefly later. CV chooses d according to the ruled argmind f ˆS 00 (h̃d )g(4)where ˆS 00 (h̃d ) is the error of h̃d on S 00 , the last m examplesof S that were withheld in selecting h̃d. Notice that for CV, weexpect the quantity (d) (h̃d ) to be (perhaps considerably)larger than in the case of GRM and MDL, because now h̃dwas chosen on the basis of only (1 , )m examples ratherthan all m examples. For this reason we wish to introduce themore general notation (d) (h̃d ) to indicate the fractionof the sample withheld from training. CV settles for (d)instead of 0 (d) in order to have an independent test set withwhich to directly estimate (d).4 A Controlled Experimental ComparisonOur results begin with a comparison of the performance andproperties of the three model selection algorithms in a carefully controlled experimental setting — namely, the intervalsmodel selection problem. Among the advantages of suchcontrolled experiments, at least in comparison to empiricalresults on data of unknown origin, are our ability to exactlymeasure generalization error (since we know the target function and the distribution generating the data), and our ability3In the full paper we justify our use of the sample points todescribe h; it is quite similar to representing h using a grid ofresolution 1 p(m) for some polynomial p( ).4With appropriately modified assumptions, all of the formal results in the paper hold for the more general form G( ˆ(d); d; m),where we decouple the dependence on d and m. However, thesimpler coupled form will suffice for our purposes.

to precisely study the effects of varying parameters of the data(such as noise rate, target function complexity, and samplesize), on the performance of model selection algorithms. Theexperimental behavior we observe foreshadows a number ofimportant themes that we shall revisit in our formal results.We begin with Figure 2. To obtain this figure, a training sample was generated from the uniform input distribution andlabeled according to an intervals function over [0; 1] consisting of 100 intervals of alternating label and equal width 5 ; thesample was corrupted with noise at rate 0:2. In Figure 2,we have plotted the true generalization errors (measured withrespect to the noise-free source of examples) GRM , MDL and CV (using test fraction 0:1 for CV) of the hypothesesselected from the sequence h̃1 ; : : :; h̃d ; : : : by each the threealgorithms as a function of sample size m, which rangedfrom 1 to 3000 examples. As described in Section 2, the hypotheses h̃d were obtained by minimizing the training errorwithin each class Fd . Details of the code used to performthese experiments will be provided in the full paper.Figure 2 demonstrates the subtlety involved in comparingthe three algorithms: in particular, we see that none of thethree algorithms outperforms the others for all sample sizes.Thus we can immediately dismiss the notion that one ofthe algorithms examined can be said to be optimal for thisproblem in any standard sense. Getting into the details, wesee that there is an initial regime (for m from 1 to slightly lessthan 1000) in which MDL is the lowest of the three errors,sometimes outperforming GRM by a considerable margin.Then there is a second regime (for m about 1000 to about2500) where an interesting reversal of relative performanceoccurs, since now GRM is the lowest error, considerablyoutperforming MDL, which has temporarily leveled off. Inboth of these first two regimes, CV remains the intermediateperformer. In the third and final regime, MDL decreasesrapidly to match GRM and the slightly larger CV , and theperformance of all three algorithms remains quite similar forall larger sample sizes.Insight into the causes of Figure 2 is given by Figure 3,where for the same runs used to obtain Figure 2, we insteadplot the quantities d GRM , d MDL and d CV , the value of d̃ chosen by GRM, MDL and CV respectively (thus, the “correct”value, in the sense of simply having the same number ofintervals as the target function, is 100). Here we see thatfor small sample sizes, corresponding to the first regime discussed for Figure 2 above, d GRM is slowly approaching 100from below, reaching and remaining at the target value forabout m 1500. Although we have not shown it explicitly, GRM is incurring nonzero training error throughout theentire range of m. In comparison, for a long initial period(corresponding to the first two regimes of m), MDL is simplychoosing the shortest hypothesis that incurs no training error(and thus encodes both “legitimate” intervals and noise), andconsequently d̃MDL grows in an uncontrolled fashion. Moreprecisely, it can be shown that during this period d MDL isobeying d̃MDL d0 2 (1 , )m (1 , 2 )2 s, wheres is the number of (equally spaced) intervals in the targetfunction and is the noise rate (so for the current experiment5Similar results hold for a randomly chosen target function.s 100 and 0:2). This “overcoding” behavior of MDLis actually preferable, in terms of generalization error, to theinitial “undercoding” behavior of GRM, as verified by Figure 2. Once d GRM approaches 100, however, the overcodingof MDL is a relative liability, resulting in the second regime.Figure 3 clearly shows that the transition from the secondto the third regime (where approximate parity is achieved)is the direct result of a dramatic correction to d̃MDL fromd0 (defined above) to the target value of 100. Finally, d CVmakes a more rapid but noisier approach to 100 than d GRM ,and in fact also overshoots 100, but much less dramaticallythan d MDL . This more rapid initial increase again results insuperior generalization error compared to GRM for small m,but the inability of d̃CV to settle at 100 results in slightlyhigher error for larger m. In the full paper, we examine thesame plots of generalization error and hypothesis complexityfor different values of the noise rate; here it must suffice tosay that for 0, all three algorithms have comparable performance for all sample sizes, and as increases so do thequalitative effects discussed here for the 0:2 case (forinstance, the duration of the second regime, where MDL isvastly inferior, increases with the noise rate).The behavior d GRM and d MDL in Figure 3 can be traced to theform of the total penalty functions for the two algorithms.For instance, in Figures 4, and 5, we plot the total MDLpenalty H( ˆ(d)) H(d m) as a function of d for the fixedsample sizes m 2000 and 4000 respectively, again usingnoise rate 0:20. At m 2000, we see that the totalpenalty has its global minimum at approximately 650, whichis roughly the zero training error value d0 discussed above(we are still in the MDL overcoding regime at this samplesize; see Figures 2 and 3). However, by this sample size,a significant local minimum has developed near the targetvalue of d 100. At m 4000, this local minimumat d 100 has become the global minimum. The rapidtransition of d MDL that marks the start of the final regimeof generalization error is thus explained by the switchingof the global total penalty minimum from d0 to 100. InFigures 6, we plot the total GRM penalty, just for the samplesize m 2000. The behavior of the GRM penalty is muchmore controlled — for each sample size, the total penaltyhas a single-minimum bowl shape, with the minimum lyingto the left of d 100 for small sample sizes and graduallymoving over d 100 and sharpening there for large m; asFigure 6 shows, the minimum already lies at d 100 bym 2000, as confirmed by Figure 3.A natural question to pose after examining Figures 2 and 3is the following: is there a penalty-based algorithm that enjoys the best properties of both GRM and MDL? By thiswe would mean an algorithm that approaches the “correct”d value (whatever it may be for the problem in hand) morerapidly than GRM, but does so without suffering the long,uncontrolled “overcoding” period of MDL. An obvious candidate for such an algorithm is simply a modified version ofGRM or MDL, in which we reason (for example) that perhaps the GRM penalty for complexity is too large for thisproblem (resulting in the initial reluctance to code), and wethus multiply the complexity penalty term in the GRM rule(the second term inside the argminf g) in Equation (1) by a

constant less than 1 (or analogously, multiply the MDL complexity penalty term by a constant greater than 1 to reduceovercoding). The results of an experiment on such a modified version of GRM are shown in Figures 7 and 8, wherethe original GRM performance is compared to a modifiedversion in which the complexity penalty is multiplied by 0.5.Interestingly and perhaps unfortunately, we see that there isno free lunch: while the modified version does indeed codemore rapidly and thus reduce the small m generalization error, this comes at the cost of a subsequent overcoding regimewith a corresponding degradation in generalization error (andin fact a considerably slower return to d 100 than MDLunder the same conditions) 6 . The reverse phenomenon (reluctance to code) is experienced for MDL with an increasedcomplexity penalty multiplier (details in the full paper).Let us summarize the key points demonstrated by these experiments. First, none of the three algorithms dominates theothers for all sample sizes. Second, the two penalty-basedalgorithms seem to have a bias either towards or against coding that is overcome by the inherent properties of the dataasymptotically, but that can have a large effect on generalization error for small to moderate sample sizes. Third, this biascannot be overcome simply by adjusting the relative weightof error and complexity penalties, without reversing the biasof the resulting rule and suffering increased generalizationerror for some range of m. Fourth, while CV is not the bestof the algorithms for any value of m, it does manage to fairlyclosely track the best penalty-based algorithm for each valueof m, and considerably beats both GRM and MDL in theirregimes of weakness. We now turn our attention to our formal results, where each of these key points will be developedfurther.5A Bound on Generalization Error forPenalty-Based Algorithmssion of Fd . Let G : [0; 1] ! be a function that is continuous and increasing in both its arguments, and let G (m)denote the expected generalization error of the penalty-basedmodel selection algorithm d argmind fG( ˆ(d); d m)g ona training sample of size m. Then 7q G(m) RG(m) d m(5)where RG(m) approaches mind f opt (d)g (which is the bestgeneralization error achievable in any of the classes Fd ) asm ! 1. The rate of this approach will depend on propertiesof G.Proof: For any value of d, we have the inequality, , G ˆ(d ); d m G ˆ(d); d m :(6)because d is chosen to minimize G( ˆ(d); d mp). Using theuniform convergence bound j (h) , ˆ(h)j d m for allh 2 Fd and the fact that G( ; ) is increasing in its firstargument, we can replace the occurrenceof ˆ(d ) on the leftq to obtain a smallerhand side of Equation (6) by (d̃) , d mquantity, and we can replacep the occurrence of ˆ(d) on theright-hand side by opt (d) d m to obtain a larger quantity.This givesG (d ) , q d m;d̃ m p G opt (d) d m; d m :(7)Now because G( ; ) is an increasing function of its secondargument, we can further weaken Equation (7) to obtain q p G (d̃) , d̃ m; 0 G opt (d) d m; d m :(8)If we define G0 (x) G(x; 0), then since G( ; ) is increasingin its first argument, G0,1 ( ) is well-defined, and we may write qp (d ) G,0 1 G opt (d) d m; d m d m:(9)Now fix any small value 0. For this , let d0 be thesmallest value satisfying opt (d0 ) mind f opt (d)g —thus, d0 is sufficient complexity to almost match the ap-We begin our formal results with a bound on the generalization error for penalty-based algorithms that enjoys threefeatures. First, it is general: it applies to practically anypenalty-based algorithm, and holds for any model selectionproblem (of course, there is a price to pay for such generality,as discussed below). Second, for certain algorithms and certain problems the bound can give rapid rates of convergenceto small error. Third, the form of the bound is suggestiveof some of the behavior seen in the experimental results.We state the bound for the special but natural case in whichthe underlying learning algorithm L is training error minimization; in the full paper, we will present a straightforwardanalogue for more general L. Both this theorem and Theorem 2 in the following section are stated for the noise-freecase; but again, straightforward generalizations to the noisycase will be included in the full paper.proximative power of arbitrarily large complexity.Exam10 ) pd0 m; d0 m)) asining the behavior of G,(G( (dopt0m ! 1, we see that the arguments papproach the point,1( opt (d0 ); 0), and so G0 (G( opt (d0 ) d0 m; d0 m)) ap,100proaches G0 (G( opt (d ); 0)) opt (d ) minf opt (d)g by continuity of G( ; ), as desired. By definingTheorem 1 Let (fFd g; f; D; L) be an instance of the modelselection problem in which L performs training error minimization, and assume for convenience that d is the VC dimen-Let us now discuss the form of the bound given in Theorem 1.The first termSRG(m) approaches the optimal generalizationerror within Fd in the limit of large m, and the secondterm directly penalizes large complexity. These terms maybe thought of as competing. In order for RG (m) to approach6Similar results are obtained in experiments in which every occurrence of d in the GRM rule is replaced by an “effective dimension” c0 d for any constant c0 1.n pRG(m) minG,0 1 G opt (d) d m; d md o(10)we obtain the statement of the theorem.7We remind the reader that our bounds contain hidden logarithmic factors that we specify in the full paper.

mind f opt (d)g rapidly and not just asymptotically (that is,in order to ha

This research was done while Y. Mansour, A. Ng and D. Ron were visiting AT&T Bell Laboratories. y Supported in part by The Israel Science Foundation, adminis-tered by The Israel Academy of Science and Humanities, and by a grant of the Israeli Ministry of Science and Technology. z Supported by the Eshkol Fellowship, sponsored by the Israeli