Towards Automatic Machine Learning Pipeline Design

29d ago
42 Views
1 Downloads
1.72 MB
105 Pages
Last View : 1d ago
Last Download : 25d ago
Upload by : Fiona Harless
Transcription

Towards Automatic Machine Learning Pipeline DesignMitar MilutinovicElectrical Engineering and Computer SciencesUniversity of California at BerkeleyTechnical Report No. s/TechRpts/2019/EECS-2019-123.htmlAugust 16, 2019

Copyright 2019, by the author(s).All rights reserved.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Towards Automatic Machine Learning Pipeline DesignbyMitar MilutinovicA dissertation submitted in partial satisfaction of therequirements for the degree ofDoctor of PhilosophyinComputer Sciencein theGraduate Divisionof theUniversity of California, BerkeleyCommittee in charge:Professor Dawn Song, ChairProfessor Trevor DarellProfessor Joseph GonzalezProfessor James HolstonSummer 2019

Towards Automatic Machine Learning Pipeline DesignCopyright 2019 by Mitar MilutinovicThis work is licensed under the Creative CommonsAttribution-ShareAlike 4.0 International LicenseTo view a copy of this license, 0/

1AbstractTowards Automatic Machine Learning Pipeline DesignbyMitar MilutinovicDoctor of Philosophy in Computer ScienceUniversity of California, BerkeleyProfessor Dawn Song, ChairThe rapid increase in the amount of data collected is quickly shifting the bottleneck ofmaking informed decisions from a lack of data to a lack of data scientists to help analyzethe collected data. Moreover, the publishing rate of new potential solutions and approachesfor data analysis has surpassed what a human data scientist can follow. At the same time,we observe that many tasks a data scientist performs during analysis could be automated.Automatic machine learning (AutoML) research and solutions attempt to automate portionsor even the entire data analysis process.We address two challenges in AutoML research: first, how to represent ML programssuitably for metalearning; and second, how to improve evaluations of AutoML systems tobe able to compare approaches, not just predictions.To this end, we have designed and implemented a framework for ML programs whichprovides all the components needed to describe ML programs in a standard way. The framework is extensible and framework’s components are decoupled from each other, e.g., theframework can be used to describe ML programs which use neural networks. We providereference tooling for execution of programs described in the framework. We have also designed and implemented a service, a metalearning database, that stores information aboutexecuted ML programs generated by different AutoML systems.We evaluate our framework by measuring the computational overhead of using the framework as compared to executing ML programs which directly call underlying libraries. Weobserve that the framework’s ML program execution time is an order of magnitude slowerand its memory usage is twice that of ML programs which do not use this framework.We demonstrate our framework’s ability to evaluate AutoML systems by comparing 10different AutoML systems that use our framework. The results show that the frameworkcan be used both to describe a diverse set of ML programs and to determine unambiguouslywhich AutoML system produced the best ML programs. In many cases, the produced MLprograms outperformed ML programs made by human experts.

iTo Andrea

iiContentsContentsiiList of FiguresivList of Tablesv1 Introduction1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142 Related work63 Framework for ML pipelines3.1 Design goals . . . . . . . . . . .3.2 Syntax of pipelines . . . . . . .3.3 Pipeline structure . . . . . . . .3.4 Primitives . . . . . . . . . . . .3.5 Primitive interfaces . . . . . . .3.6 Hyper-parameters configuration3.7 Basic data types . . . . . . . .3.8 Data references . . . . . . . . .3.9 Metadata . . . . . . . . . . . .3.10 Execution semantics . . . . . .3.11 Example pipeline . . . . . . . .3.12 Problem description . . . . . . .3.13 Reference runtime . . . . . . . .3.14 Evaluating pipelines . . . . . .3.15 Metalearning . . . . . . . . . .8812121316202223232629313333344 Pipelines in practice4.1 Standard pipelines . . . . . . . .4.2 Linear pipelines . . . . . . . . . .4.3 Reproducibility of pipelines . . .4.4 Representation of neural networks.3535364042

iii4.54.6Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Use in AutoML systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Future work and conclusions5.1 Evaluating pipelines on raw data5.2 Simplistic problem description . .5.3 Data metafeatures . . . . . . . .5.4 Pipeline metafeatures . . . . . . .5.5 Pipeline validation . . . . . . . .5.6 Pipeline execution optimization .5.7 Conclusions . . . . . . . . . . . .44454848494949505051A Terminology53B Pipeline description55C Problem description56D Primitive metadata58E Container metadata60F Data metadata62G Semantic types64H Hyper-parameter base classes70I73Pipeline run descriptionJ Example pipeline75K Example linear pipeline80L Example neural network pipeline83Bibliography90

ivList of Figures1.11.2Annual size of the global datasphere . . . . . . . . . . . . . . . . . . . . . . . .Number of AI/ML preprints on arXiv published each year . . . . . . . . . . . .3.13.23.33.43.53.6Example ML program in Python programming language . . . . . . . . . . . .Example program in a different programming style . . . . . . . . . . . . . . .Example hyper-parameters configuration . . . . . . . . . . . . . . . . . . . . .Visual representation of example metadata selectors . . . . . . . . . . . . . . .Visual representation of an example pipeline . . . . . . . . . . . . . . . . . . .Visual representation of the example pipeline with all hyper-parameter values.910212530324.14.24.34.44.54.64.7Conceptual representation of a general pipeline . . . . . . . . . . . . .Conceptual representation of a standard pipeline . . . . . . . . . . . .Conceptual representation of a linear pipeline . . . . . . . . . . . . . .Visual representation of an example linear pipeline . . . . . . . . . . .Visual representation of an example pipeline of a neural network . . . .Averaged execution times of ML programs and corresponding pipelinesResults of running 10 AutoML systems on 48 datasets . . . . . . . . .35363637434647.12

vList of Tables1.11.21.3The intensification of local shortages for data science skills . . . . . . . . . . . .A sample of the Iris dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .An example metalearning dataset . . . . . . . . . . . . . . . . . . . . . . . . . .2444.1Run time and memory usage of example programs and pipelines . . . . . . . . .45

viAcknowledgmentsForemost, I would like to thank my wife Andrea and my son Nemo for their utmostpatience. All your hugs gave me all the energy I needed.In no particular order, I would like to thank Ryans Zhuang, Kevin Zeng, Julia Cao,Roman Vainshtein, Rok Mihevc, Asi Messica, Charles Packer, and many other colleaguesand students at UC Berkeley with whom I have worked on projects and research underpinningthis work. Without you this work would not have been possible.This work builds on many other projects and collaborations, primarily through the DarpaD3M program. I would like to thank everyone in the program, and especially those active inworking groups through which discussed many topics present in this work. Just to name a fewwho have again and again stepped up to various challenges along the way: Diego Martinez,Brandon Schoenfeld, Sujen Shah, Mark Hoffmann, Alice Yepremyan, Shelly Stanley. No listwould be complete without Wade Shen, who has had the vision and commitment to push theprogram through despite all the issues along the way. Moreover, Rob Zinkov, Atılım GüneşBaydin, and Frank Wood were pivotal in pushing me to see that what has been a simpleinitial straw man proposal can be much more, and that has ultimately led to this work.Amazing Ray team, especially Robert Nishihara and Richard Liaw, thank you for guidingme when I got stuck. Moreover, thank you for addressing issues and feature requests quicklyand efficiently, this makes your project really special.Thank you to all who read through early drafts of this work and gave valuable feedback,especially Diego Martinez, Brandon Schoenfeld, Adrianne Zhong, Marten Lohstroh, andAndreas Mueller.I was lucky to have not just one but three advisors: professors Dawn Song, Trevor Darell,and Joseph Gonzalez. Thank you for all the insights, suggestions, hard questions, and gentlepushes. Each of you contributed a fundamental piece of my experience at UC Berkeley.Of course, without my parents and their support at every step along the way, nothingwould have ever been possible. Thank you.

1Chapter 1IntroductionSize (ZB)50403020100Year2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020Figure 1.1: Annual size of the global datasphere. Source: IDC, November 2018, sponsoredby Seagate [39].Data available for potential use in ML programs is growing at a high rate as shown inFigure 1.1. IDC forecasts the global datasphere to grow to 50 ZB by 2020 [39]. At the sametime there is a shortage of data scientists, Table 1.1. Looking at the number of AI/MLpreprints on arXiv in cs.AI, stat.ML, and cs.NE categories published each year (Figure 1.2)we can see that it is growing dramatically and that just in 2018 there were more than 12,000preprints published on arXiv alone. Those preprints can contain potential new solutionsand approaches which could be used in ML programs. But it is not possible for any singleindividual to learn about them all, to learn for which problems and data types they areuseful, to learn their effective combinations, nor how they could be used in ML programs.One way to address this challenge is by using an Automated Machine Learning (AutoML)system to help analyze data and build ML programs which use data. Given data and a

CHAPTER 1. INTRODUCTIONMetro AreaNew York City, NYSan Francisco Bay Area, CALos Angeles, CABoston, MASeattle, WAChicago, ILWashington, D.C.Dallas-Ft. Worth, TXAtlanta, GAAustin, TX2July 2015 July 2018 3Y Delta 4, 132 34, 032 29, 900 10, 995 31, 798 20, 803 425 12, 251 11, 826 1, 667 11, 276 9, 609 1, 182 9, 688 8, 506 1, 826 5, 925 7, 751 735 7, 686 6, 951 2, 496 3, 641 6, 137 2, 301 3, 350 5, 651 26 4, 949 4, 923Table 1.1: The intensification of local shortages for data science skills, July 2015 to July2018. Table provides the shortage ( ) or surplus (-) of people with data science skills ineach metro area, and the associated delta over three years. Source: LinkedIn r1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018Figure 1.2: Number of AI/ML preprints on arXiv in cs.AI, stat.ML, and cs.NE categoriespublished each year. Source: arXiv, May 2019 [8].

CHAPTER 1. INTRODUCTION3problem description, an AutoML system automatically creates an ML program to preprocessthis data and to build a machine learning model solving the problem. Ideally, the automaticprocess of creation of an ML program should take into account all existing ML knowledge.In the context of AutoML research, there are two challenges we tackle in this work:how to compare AutoML systems and how to better support AutoML systems which usemetalearning.Comparison of AutoML systemsThere are many attempts at AutoML systems both in academia and industry (see Chapter 2), but there are many challenges to determine the quality of those AutoML systems.‘Quality of an AutoML system can consist of many factors, e.g.: How much resources it needs to run? How quickly it creates an ML program? How far is this ML program from the best ML program? How clean or structured input data has to be? Which problem types does it support? How well it searches the space of possible ML programs?Moreover, comparison of AutoML systems is hard because they use different sets ofbuilding blocks in their ML programs and use different datasets for their reported evaluation.If building blocks are different, maybe one AutoML system has simply a better building blockavailable and this is why its ML program outperforms an ML program made by some otherAutoML system. But if both systems used the same building blocks, it might be the casethat the latter AutoML system creates that same (better) ML program as well and evenfaster.Furthermore, it is hard to compare AutoML systems if the quality of ML programsthemselves is not well defined or if different ML programs use different definitions of quality. Generally we care about ML program’s quality of predictions as computed by somemetric, but there are also other aspects of ML programs we can care about: complexity,interpretability, generalizability, resources and data requirements, etc.MetalearningOne big family of approaches to AutoML is centered around metalearning. Metalearningtreats AutoML itself as an ML problem. Because both ML programs as created by suchan AutoML system and the AutoML system itself both operate on data and build an ML

CHAPTER 1. INTRODUCTIONsepal length sepal width petal 75.14petal -virginicaTable 1.2: A sample of the Iris dataset [2, blem egressionforecastingforecastingprogram score0.870.930.850.990.910.51Table 1.3: An example metalearning dataset.model, we use prefix meta when talking about the data and model of an AutoML systemwhich uses metalearning: metalearning dataset or meta-dataset and meta-model.For metalearning a dataset is needed which serves as input to a meta-model. If a regulartabular dataset looks like Table 1.2, a metalearning dataset looks like Table 1.3: insteadof samples with attributes and targets, each sample consists, conceptually, of a dataset,a problem description, a program, and a score achieved by executing the program on thedataset and the problem description. A meta-model is trained that for a given new datasetand a problem description it constructs an ML program which achieves the best score.There are many challenges about metalearning, but in this work we focus on how to representprograms in such metalearning dataset. Representation of datasets and problem descriptionswe leave to future work in Sections 5.2 and 5.3.1.1ContributionsWe address challenges presented with the following contributions: We have designed and implemented a framework for ML programs which provides allcomponents needed to describe ML programs in a standard way suitable for metalearning. The framework is extensible and framework’s components are decoupled from eachother. We provide reference tooling for execution of programs described in the framework.

CHAPTER 1. INTRODUCTION5 We present how this framework is used by 10 AutoML systems and how it addressesthe challenge of comparison of AutoML systems. We have designed and implemented a service to serve as a metalearning dataset, storinginformation about executed ML programs by different AutoML systems.Contributions empower each other: a standard way of describing ML programs enablesboth better comparison between AutoML systems and metalearning across ML programscreated by different AutoML systems, allowing shared representation of information aboutexecuted ML programs and construction of a metalearning dataset.

6Chapter 2Related workThe AutoML research field is active and vibrant and has produced many academic andnon-academic systems [10, 14, 18, 20, 22, 23, 27, 28, 31, 32, 38, 40, 41, 42, 43, 45, 47, 50, 52],including some focusing on neural networks only [3, 9, 21, 29, 36, 53]. Wse can observe [12]that there are many approaches they take and that they are implemented in various programming languages. Those differences lead to challenges in comparison of AutoML systems.Existing comparisons [17, 19] compare only predictions made by those systems. While forpractical purposes it is important to compare what can systems achieve as they are, it doesnot provide any insight into how well the approaches they are taking fundamentally compare. Comparison is further complicated because different systems support different datatypes and task types. In this work we present a framework which enables comparison ofapproaches and not just predictions, across data types and problem types.Many AutoML systems, to our knowledge at least [10, 14, 27, 40, 41, 43, 52], use somesort of metalearning. But they cannot learn from results across systems. Our frameworkaddresses that through shared representation of ML programs and a shared metalearningservice. [49] is a similar shared service to store information about ML programs and theirperformance on various datasets, but stored performance scores are self-reported and MLprograms are not necessarily reproducible, limiting usefulness for cross-system metalearning.Systems which focus on neural networks [3, 9, 21, 29, 36, 53] can be combined with otherAutoML systems using our framework.AutoML systems do not use one shared representation of ML programs. There aresome popular pipeline languages which might be candidates for such a purpose. scikitlearn [33] pipeline allows combining multiple scikit-learn transforms and estimators. Whilepowerful, it inherits some weaknesses from scikit-learn itself, primarily its support for onlytabular and already structured data. This prevents it to be used when inputs are raw files.Moreover, its combination of linear and nested structure can become very verbose. CommonWorkflow Language [46] is a standard for describing data-analysis workflows with focus onreproducibility. But its focus is also on combining command line programs into workflows,which is generally not what ML programs made by AutoML systems consist of. Kubeflow [24]provides a pipeline language and at the same time makes their deployments on Kubernetes

CHAPTER 2. RELATED WORK7simple. Similar to our framework it allows combining components using different libraries,but every component is a Docker image, and instead of directly passing memory objectsbetween components, inputs and outputs have to be serialized.There are existing tools to describe hyper-parameters configuration [5, 13]. Our framework aims to be compatible with them while extending a static configuration with optionalcustom sampling logic. This allows authors to define a new type of a hyper-parameter spaceand provide a custom sampling logic without AutoML systems having to support that typeof a hyper-parameter space in advance.

8Chapter 3Framework for ML pipelinesWe have designed and implemented a framework for ML programs for use in AutoMLsystems. We provide reference tooling for execution of programs described as framework’spipelines. We have designed and implemented a service to store and share information aboutexecuted pipelines as pipeline run descriptions. The collection of pipeline run descriptionscan serve as a metalearning dataset.In this chapter we present technical details of the framework and related tooling andservice.3.1Design goalsIn 2019, the most popular programming language for ML programs was Python [11]. Anexample of such an ML program in Python for the Thyroid disease dataset [48] is availablein Figure 3.1. In the example program we first select a target column and attribute columnsfrom input data. Then we further select numerical and categorical attributes. We encodecategorical attributes and we impute missing values in numerical attributes. After that wecombine categorical attributes and numerical attributes back into one data structure of allattributes. We then pass this data structure, together with targets, to a classifier to fit andpredict. The program runs in two passes, in the first we fit on training data and in the secondpass we only predict on testing data. This example program contains general steps found inML programs: data loading, data selection and cleaning, and finally model building. It doesnot contain common steps like feature extraction, construction, and selection.If we look at such ML programs as raw input of an AutoML system from which thesystem might want to learn from, we can observe: Language specific constructs which have nothing to do with the ML task at hand, e.g.,import statements. Syntax of the programming language allows logically equivalent programs to be represented with different characters (changing a variable name does not change the logic

CHAPTER 3. FRAMEWORK FOR ML 26272829303132333435363738394041import numpyimport pandasfrom sklearn.preprocessing import OrdinalEncoderfrom sklearn.impute import SimpleImputerfrom sklearn.ensemble import RandomForestClassifiertrain dataframe pandas.read csv(’sick train split.csv’)test dataframe pandas.read csv(’sick test split.csv’)encoder OrdinalEncoder()imputer SimpleImputer()classifier RandomForestClassifier(random state 0)def one pass(dataframe, is train):target dataframe.iloc[:, 30]attributes dataframe.iloc[:, 1:30]numerical attributes attributes.select dtypes(numpy.number)categorical attributes attributes.select dtypes(numpy.object)categorical attributes categorical attributes.fillna(’’)if is train:encoder.fit(categorical attributes)imputer.fit(numerical attributes)categorical attributes encoder.transform(categorical attributes)numerical attributes imputer.transform(numerical attributes)attributes numpy.concatenate([categorical attributes,numerical attributes,], axis 1)if is train:classifier.fit(attributes, target)return classifier.predict(attributes)one pass(train dataframe, True)predictions one pass(test dataframe, False)Figure 3.1: An example ML program in Python programming language.9

CHAPTER 3. FRAMEWORK FOR ML 26272829303132333435363738import numpyimport pandasfrom sklearn.compose import make column transformerfrom sklearn.pipeline import make pipelinefrom sklearn.preprocessing import OrdinalEncoderfrom sklearn.impute import SimpleImputerfrom sklearn.ensemble import RandomForestClassifiertrain dataframe pandas.read csv(’sick train split.csv’)test dataframe pandas.read csv(’sick test split.csv’)train attributes train dataframe.iloc[:, 1:30]train target train dataframe.iloc[:, 30]test attributes test dataframe.iloc[:, 1:30]def get numerical attributes(X):return X.dtypes.apply(lambda d: issubclass(d.type, numpy.number))def get categorical attributes(X):return X.dtypes ’object’pipeline make pipeline(make column transformer((make pipeline(SimpleImputer(strategy ’constant’, fill value ’’),OrdinalEncoder(),),get categorical attributes,),(SimpleImputer(), get numerical attributes),),RandomForestClassifier(random state 0),)pipeline.fit(train attributes, train target)predictions pipeline.predict(test attributes)Figure 3.2: An example program from Figure 3.1 in a different programming style.10

CHAPTER 3. FRAMEWORK FOR ML PIPELINES11of the program, adding a comment or empty lines neither). Different programming styles might lead to very different programs and different waysof expressing the same underlying logic. The program in Figure 3.2 uses a differentprogramming style, but is logically equivalent to the program in Figure 3.1. The programming language used is a general programming language which allows program to do more than just solve an ML task, e.g., display user interface, periodicallysave state, parallelize execution. That code is interleaved with code corresponding tothe ML task. The programming language allows code with side effects and non-determinism. Thiscan lead to a program not producing the same results when run multiple times onthe same input data. Reproducibility of results can be achieved primarily throughprogramming discipline.Such properties of a programming language and programs in that language are generally reasonable and even seen as an advantage of the programming language when theprogramming language is used by humans. But in the context of AutoML systems whichwould consume such programs for learning and produce new programs as their outputs, allautomatically, those properties can be seen as unnecessary complexity.In this work we present a framework for ML pipelines that AutoML systems can directlyconsume and produce. The design goals of this framework are: The framework should allow most of ML and data processing programs to be describedas its pipelines, if not all, but be as simple as possible to facilitate both automaticgeneration and automatic consumption of pipelines. Pipelines should allow description of complete end-to-end ML programs, starting withraw files and finishing with predictions or any other ML output from models embeddedin pipelines. The focus of the framework is machine generation and consumption as opposed tohuman generation and consumption. It should enable automation as much as possible. The framework should be extensible and framework’s components should be decoupledfrom each other, cf. in most programming languages a typing system and executionsemantics are tightly coupled with the language itself. Control of side-effects and randomness in pipelines, and in general full reproducibilityshould be part of the framework and not an afterthought.

CHAPTER 3. FRAMEWORK FOR ML PIPELINES3.212Syntax of pipelinesPipelines do not have a human-friendly syntax and are primarily represented as inmemory data structures. Many of our framework’s components, including pipelines, canbe represented in JSON [6] or YAML [4] serialization formats. We provide validators usingJSON Schema [16] to validate serialized data structures.3.3Pipeline structureIn our framework, ML programs are described as pipelines. Such pipelines consist of: Pipeline metadata. Specification of inputs and outputs of the pipeline. Definition of pipeline steps.While pipeline is an in-memory structure, we call its standard representation a pipelinedescription. We support JSON and YAML serialization formats for pipeline descriptionsand we provide a validator for pipeline descriptions using JSON Schema. The full list ofstandardized top-level fields of pipeline descriptions is available in Appendix B. Moreover,we can represent the main aspects of a pipeline structure visually. In this work we will useYAML and visual representations to present the pipeline structure.Pipeline metadata contains mostly non-essential information about the pipeline: ahuman-friendly name and description, and when and how the pipeline was created. Theonly required metadata is a pipeline’s universally unique identifier, UUID [25]. We standardize metadata as part of a pipeline description’s JSON Schema.Specification of inputs and outputs of the pipeline consist of defining the number of inputsand outputs the pipeline has, and optionally providing human-friendly names for them.Pipeline steps define the logic of the pipeline. They are specified in order and each stepdefines its inputs and outputs and how step’s inputs connect to any output of any previousstep or pipeline’s inputs. Connecting steps in this manner forms a DAG. There are currentlythree types of steps defined: Primitive step. Sub-pipeline step. Placeholder step.Primitive step represents execution of a primitive. Primitives are described in Section 3.4.Sub-pipeline step represents execution of another pipeline as a step. This is similar to afunction call in programming languages. Placeholder step can be used to define pipelinetemplates which can be used to represent partially defined pipelines.

CHAPTER 3. FRAMEWORK FOR ML PIPELINES13Pipeline metadata can contain a digest over whole pipeline description. References toprimitives and sub-pipelines can contain their expected digest as well. When a pipelineis loaded and references are de-referenced, it might happen that a different version of aprimitive or a sub-pipeline is found. Those differences can be detected through mismatcheddigests and can help better understand why pipeline results might not be reproducible. Wediscuss reproducibility of pipelines in more detail in Section 4.3.Note that pipeline structure is defined in general terms and can be extended with otherstep types. Moreover, the semantics of inputs, outputs, and the connections between themare not restricted by the pipeline structure.3.4PrimitivesPrimitives are basic building blocks of pipelines. They represent learnable functions,functions which do not have their logic necessarily defined in advance, but can learn it givenexample inputs and outputs. Concrete definition of semantics of such learnable functionsdepends on execution semantics used, which we will explore in Section 3.10. Moreover,primitives can be defined as regular functions as well,

this data and to build a machine learning model solving the problem. Ideally, the automatic process of creation of an ML program should take into account all existing ML knowledge. In the context of AutoML research, there are two challenges we tackle in this work: how to compare AutoML s