Python In HPC

11m ago
2.52 MB
75 Pages
Last View : 4d ago
Last Download : 4d ago
Upload by : Mollie Blount

Python in HPCNIH High Performance Computing [email protected] Resch

Why python?

Because Python is simpleprint("Hello world") import thisThe Zen of Python, by Tim PetersBeautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.

Because Python is fully featuredPython comes with a full set of basic data types, modules, errorhandling and accomodates writing code in procedural or objectoriented style. Also includes some functional elements,comprehensions, and advanced features.

Because Python is readabledef get at content(dna):"""return the AT content of a DNA string.The string must be in upper case.The AT content is returned as a float"""a count dna.count('A')t count dna.count('T')at content float(a count t count) / len(dna)return at content

Because Python is extensibleC (C API, cython, ctypes, cf )C (boost)Fortran (f2py)Rust (ctypes, cf , rust-cpython)

Because Python has many third partylibraries, tools, and a large community

Because Python is ubiquitous, portable, andfreeevery linux distributionmost (every?) clusterWindows, OS X

A quick word about Python 3 vs 2Python 3 made backwards incompatible changes (print, exceptions,division, unicode, comprehension variables, open(), .).There used to be reasons to write new code in Python 2, but they aregetting less compelling.Soon-ish support for Python 2 by the core Python team and majorpackages (e.g. numpy) will end.

But Python is slow.Right?

Well - kind ofdata from

But for a lot of tasksit doesn't really matter

How do you decide if it matters?Is my code fast enough to produce the results I need in the time Ihave?How many CPUh is this code going to waste over its lifetime?How inef cient is it?How long does it run?How often will it run?Does it cause problems on the system it's running on?How much effort is it to make it run faster?

Luckily, if it does matter for your codeit can often be made faster

Make it go faster

Pro leIt's necessary to know what sections of code are bottlenecksin order to improve performance.Measure - don't guess

Example: Mandelbrot set

def linspace(start, stop, n):step float(stop - start) / (n - 1)return [start i * step for i in range(n)]def mandel1(c, maxiter):z cfor n in range(maxiter):if abs(z) 2:return nz z*z creturn ndef mandel set1(xmin -2.0, xmax 0.5, ymin -1.25, ymax 1.25,width 1000, height 1000, maxiter 80):r linspace(xmin, xmax, width)i linspace(ymin, ymax, height)n [[0]*width for in range(height)]for x in range(width):for y in range(height):n[y][x] mandel1(complex(r[x], i[y]), maxiter)return n

Baseline timing with%timeitipython magic:In [4]: %timeit mandel set1()7.77 s 26.8 ms per loop (mean std. dev. of 7 runs, 1 loop each)Equivalently, on the command line: python -m timeit -s 'import mandel01' 'mandel01.mandel set1()'10 loops, best of 3: 7.81 sec per loopSo 8s to calculate area1 .

Pro ling with%prunipython magic:In [4]: %prun -s cumulative mandel set1()or the equivalent command line below. This, however, requires anexecutable script. python -m cProfile -s cumulative

Yields the following results:25214601 function calls in 12.622 secondsOrdered by: cumulative 2.62212.62212.60911.6482.647Most time is spent in ame:lineno(function){built-in method builtins.exec} module ) set1){built-in method builtins.abs}functionpro ling introduces some overhead (runtime of 12s vs 8s)

Pro ling results can be visualized with SnakeViz: python -m cProfile -o snakeviz --port 6542 --hostname localhost --server mandel01.profWhich starts a web server on port 6542.

Snakeviz generates an interactive visualization and sortable table:

Most the time is spent in the mandel1() function. Use thepackage to pro le this function line by line.Using%lprunipython magic:%load ext line profiler%lprun -f mandel1 mandel set1()line profiler

To do the equivalent on the command line, import the line profilerpackage and decorate the function(s) to pro le with @profile :import line [email protected] mandel1(c, maxiter):z cfor n in range(maxiter):if abs(z) 2:return nz z*z creturn n.Then, on the command line: kernprof -l -v

The line-by-line pro le returned by either method:Total time: 42.2215 sFile: mandel01.pyFunction: mandel1 at line 7Line #HitsTime Per Hit% Time Line Contents [email protected] mandel1(c, maxiter):910000004412850.41.0z c10 24463110116687830.527.6for n in range(maxiter):11 24214592165651640.739.2if abs(z) 2:127514823451960.50.8return n13 23463110130816880.631.0z z*z c142485181194310.50.3return nThere are some algorithmic improvements possible here, but let's rsttry the simplest thing we can do.

Improve sequential performance

Using thenumbajust in time (jit) compilerfrom numba import [email protected] mandel2(c, maxiter):z cfor n in range(maxiter):if abs(z) 2:return nz z*z creturn nNo changes to the code - just decorate the function in the tight loopwith a @jit results in a 7-fold speedup when calculating area1In[4]: %timeit mandel set2()1.13 s 23.7 ms per loop (mean std. dev. of 7 runs, 1 loop each)

Converting mandel set to numpy arraysNow mandel set is the bottleneck. Since it uses nested lists, numbacan't jit compile it. Can we speed it up by converting to numpy arrays?def mandel set3(xmin -2.0, xmax 0.5, ymin -1.25, ymax 1.25,width 1000, height 1000, maxiter 80):r np.linspace(xmin, xmax, width)i np.linspace(ymin, ymax, height)n np.empty((height, width), dtype int)for x in range(width):for y in range(height):n[y, x] mandel3(complex(r[x], i[y]), maxiter)return n

No - it's actually slower now on area1 .In[4]: %timeit mandel set3()1.43 s 53.9 ms per loop (mean std. dev. of 7 runs, 1 loop each)numpy arrays have some overhead that may hurt performace withsmaller array sizes. But now the function can be jit compiled withnumba by decorating it with the @jit decorator.In[4]: %timeit mandel set4()467 ms 143 μs per loop (mean std. dev. of 7 runs, 1 loop each)Now the speedup is 17-fold.

Total time: 42.2215 sFile: mandel01.pyFunction: mandel1 at line 7Line #HitsTime Per Hit% Time Line Contents [email protected] mandel1(c, maxiter):910000004412850.41.0z c10 24463110116687830.527.6for n in range(maxiter):11 24214592165651640.739.2if abs(z) 2:127514823451960.50.8return n13 23463110130816880.631.0z z*z c142485181194310.50.3return n

Algorithmic improvementBased on the de nitions of the absolute value and the square of acomplex number, we can factor out some calculations in the mandelmethod:@jitdef mandel5(creal, cimag, maxiter):real crealimag cimagfor n in range(maxiter):real2 real*realimag2 imag*imagif real2 imag2 4.0:return nimag 2 * real*imag cimagreal real2 - imag2 crealreturn n

Which gives us a respectableIn[4]: %timeit mandel set5()104 ms 173b µs per loop (mean std. dev. of 7 runs, 1 loop each)75-fold improvement over the original pure python implementation.

CythonCython is a python-like language that can be compiled to a Cextension for python. In Ipython/Jupyter notebooks using cython is assimple asIn[4]: %load ext cython

Themandelfunction in cython%%cythonimport cythonimport numpy as npcdef int mandel6(double creal, double cimag, int maxiter):cdef:double real2, imag2double real creal, imag cimagint nfor n in range(maxiter):real2 real*realimag2 imag*imagif real2 imag2 4.0:return nimag 2* real*imag cimagreal real2 - imag2 creal;return n

Themandel setfunction in (False)cpdef mandel set6(double xmin, double xmax, double ymin, double ymax,int width, int height, int maxiter):cdef:double[:] r1 np.linspace(xmin, xmax, width)double[:] r2 np.linspace(ymin, ymax, height)int[:,:] n np.empty((height, width), np.int32)int i,jfor i in range(width):for j in range(height):n[j,i] mandel6(r1[i], r2[j], maxiter)return n

In[4]: %timeit mandel set6(-2, 0.5, -1.25, 1.25, 1000, 1000, 80)103 ms 2.31 ms per loop (mean std. dev. of 7 runs, 10 loops each)So cython runs as fast as the numba version at the cost of changingcode.

GPU with pyOpenCLUsing pyopencl to implement the mandel function with an NVIDIA K80backend, the following timing was measured (for code see thenotebook on GitHub):In[4]: %timeit mandel set7(-2, 0.5, -1.25, 1.25, 1000, 1000, 80)26.9 ms 1.05 ms per loop (mean std. dev. of 7 runs, 1 loop each)A 278-fold decrease in runtime compared to the pure pythonimplementation.

FortranFortran bindings can be created with the numpy utilitydoes fortran stack up againstnumba and cython?f2py .How cat mandel8.f90subroutine mandel set8(xmin, xmax, ymin, ymax, width, height, itermax, n)real(8), intent(in):: xmin, xmax, ymin, ymaxinteger, intent(in):: width, height, itermaxinteger:: niterinteger, dimension(width, height), intent(out) :: ninteger:: x, yreal(8):: xstep, ystepxstep (xmax - xmin) / (width - 1) f2py -m mb fort -c mandel8.f90 --fcompiler gnu95.

In[4]: from mb fort import mandel8, mandel set8In[5]: %timeit mandel set8(-2, 0.5, -1.25, 1.25, 1000, 1000, 80)114 ms 3.11 ms per loop (mean std. dev. of 7 runs, 1 loop each)about 10% slower than the numba and cython implementations.

Improve sequential performance - SummaryImplementation area1 time speedup area2 time speedup1 pure python7.770s1x174.00s1x2 numba 11.130s7x11.70s15x3 numpy1.430s5x11.90s15x4 numba 20.467s17x10.80s16x5 algo0.104s75x2.67s64x8 f2py0.114s68x2.67s64x6 cython0.103s75x2.63s66x7 pyopencl0.028s228x0.06s3047x

Was that really all sequential?Numpy can be compiled against different backends. Some of them,like MKL and OpenBlas, implement implicit parallelism for someoperations. We use Anaconda python which now uses MKL, so someof the numpy code could have been implicitly parallel.import config()lapack opt info:define macros [('SCIPY MKL H', None), ('HAVE CBLAS', None)]library dirs ['/usr/local/Anaconda/envs/py3.5/lib']include dirs s ['mkl intel lp64', 'mkl intel thread', 'mkl core', 'iomp5', 'pthread'].

Parallelize - within a single machine

A word about the Python interpreterPython only allows a singly thread to execute Python bytecode at anyone time. Access to the interpreter is enforced by the GlobalInterpreter Lock (GIL).While this is sidestepped by I/O, it does prevent true parallelism withpure Python threads.However, compiled extension modules can thread and otherparadigms for parallelism have developed.

numba.vectorizeand numba.guvectorize are convenience decorators forcreating numpy ufuncs that can be single threaded, parallel, or useGPU for computation. Parallel computation uses threads.numba.vectorize

@vectorize([int32(complex64, int32)], target 'parallel')def mandel9(c, maxiter):nreal 0real 0imag 0for n in range(maxiter):nreal real*real - imag*imag c.realimag 2* real*imag c.imagreal nreal;if real * real imag * imag 4.0:return nreturn ndef mandel set9(xmin -2.0, xmax 0.5, ymin -1.25, ymax 1.25,width 1000, height 1000, maxiter 80):r1 np.linspace(xmin, xmax, width, dtype np.float32)r2 np.linspace(ymin, ymax, height, dtype np.float32)c r1 r2[:,None]*1jn mandel9(c,maxiter)return n

Implementation1 pure pythonarea2 time speedup ef ciency174.00s1xNA5 numba algo2.67s64xNA9 vectorize, 1 thread2.72s64x100%vectorize, 2 threads1.54s113x88%vectorize, 4 threads1.08s161x63%vectorize, 8 threads0.60s290x57%vectorize, 14 threads0.42s420x47%With one thread (set with NUMBA NUM THREADS ) it matches the bestsequential implementation. More threads improve upon this, butscaling isn't great.

multiprocessingMultiprocessing uses subprocesses rather than threads forparallelism to get around the GILEach child inherits the state of the parentAfter the fork data has to be shared explicitly via interprocesscommunicationSpawning child processes and sharing data have overheadThe multiprocessing API is similar to the threading API

multiprocessing - things to watch out forIf each child is also doing (implicit) threading, care has to be taken tolimit nthreads children to the number of available CPUschildDon't usemultiprocessing.cpu count()Make children ignoreSIGINT- it returns all CPUs on the nodeand parent handle it gracefullyScript's main should to be safely importable - less important on linux

multiprocessing the Mandelbrot setCreate one to many subprocesses and execute CPU boundcomputations on each independently. In this example we'll see amultiprocessing.Pool of worker processes each processing one row ofthe Mandelbrot set.

@jitdef mandel10(creal, cimag, maxiter):real crealimag cimagfor n in range(maxiter):real2 real*realimag2 imag*imagif real2 imag2 4.0:return nimag 2 * real*imag cimagreal real2 - imag2 crealreturn [email protected] mandel10 row(args):y, xmin, xmax, width, maxiter argsr np.linspace(xmin, xmax, width)res [0] * widthfor x in range(width):res[x] mandel10(r[x], y, maxiter)return res

def mandel set10(ncpus 1, xmin -2.0, xmax 0.5, ymin -1.25,ymax 1.25, width 1000, height 1000, maxiter 80):i np.linspace(ymin, ymax, height)with mp.Pool(ncpus) as pool:n row, ((a, xmin, xmax, width, maxiter) for a in i))return n

How is the performance on area2 ?Implementation1 pure pythonarea2 time speedup ef ciency174.00s1xNA2.67s64xNA10 multiproc, pool(1)3.28s53x100%multiproc, pool(2)1.90s92x86%multiproc, pool(4)1.30s134x63%multiproc, pool(8)1.08s161x37%multiproc, pool(14)1.16s150x20%5 numba algoSlighly worse than implementation 5 with 1 CPU and not scaling well.This problem is not very suited to multiprocessing.

Parallelize - across machines

MPIthe Message Passing Interface is a portable, and performant standardfor communication between processes (tasks) within and betweencompute nodes. Communication can be point-to-point or collective(broadcast, scatter, gather, .).is a Python implementation of MPI. Documentation is availableon readthedocs and at the scipy mpi4py sitempi4py

MPI - point-to-point communicationfrom mpi4py import MPIimport numpyimport timecomm MPI.COMM WORLDrank comm.Get rank()# pass data type explicitly for speed and use upper case 'Send' / 'Recv'if rank 0:data numpy.arange(100, dtype 'i')comm.Send([data, MPI.INT], dest 1)print "Rank 0 sent numpy array"if rank 1:data numpy.empty(100, dtype 'i')comm.Recv([data, MPI.INT], source 0)print "Rank 1 received numpy array"

MPI - point-to-point communication#!/bin/bash#SBATCH --ntasks 2#SBATCH --ntasks-per-core 1module load openmpi/1.10.0/gcc-4.4.7module load python/2.7mpiexec ./numpy

MPI - Mandelbrot setEach mpi task will process a consecutive chunk of rows using thefunctions jit compiled with numba.from mpi4py import MPIimport numpy as npfrom numba import jitcommsizerank MPI.COMM WORLD comm.Get size() comm.Get rank()

MPI - Mandelbrot set# how many rows to compute in this rank?N height // size (height % size rank)N np.array(N, dtype 'i') # so we can gather it later on# what slice of the whole should be computed in this rank?start i comm.scan(N) - Nstart y ymin start i * dyend y ymin (start i N - 1) * dy# calculate the local results - using numba.jit **without parallelism**Cl mandel set(xmin, xmax, start y, end y, width, N, maxiter)

MPI - Mandelbrot setrowcounts 0C Noneif rank 0:rowcounts np.empty(size, dtype 'i')C np.zeros([height, width], dtype 'i')comm.Gather(sendbuf [N, MPI.INT],recvbuf [rowcounts, MPI.INT],root 0)comm.Gatherv(sendbuf [Cl, MPI.INT],recvbuf [C, (rowcounts * width, None), MPI.INT],root 0)

MPI - Mandelbrot setTesting the MPI code on a higher resolution (32k x 32k) of area2 andcomparing it to the numba.vectorize thread parallel implementationwhich is limited to a single nodethreads/tasks threadedMPI Nodes41110s 838s18620s 341s116342s 211s432179s 105s464ND94s10

ipyparallel“ ipyparallel enables all types of parallel applications to beDocumentation - GitHub“developed, executed, debugged and monitored interactively

Spark“ Apache Spark is a fast and general engine for large-scale dataKey idea: Resilient Distributed Datasets (RDDs) are collections ofobjects across a cluster that can be computed on via paralleltransformations (map, lter, .).There are APIs in a number of languages: Python, R, Java, and Scala.Home page“processing.

Python import problemDuring startup python does a lot of small le operations as it locatesall the les it needs. These metadata heavy operations can strain thele systems if many of them happen at the same time.

Python import problem

Python import problemOne solution is to containerize the python interpreter like in theexample above.Other solutions:For mpi processes: import in the root process and share les viaMPIstatic python buildscache the shared objects / python packagesNERSC talk - NERSC paper - Python MPI bcast - Static python Scalable python

Python on Biowulf

Main python modules module load python/2.7 module load python/3.4 module load python/3.5Packages in these environments are updated regularly. To see whatpackages are installed use module load python/2.7 conda list conda list numba

Conda environments: Create your own module load Anaconda conda create -n myenv python 3.5Activate the environment and install packages with conda or pip source activate myenv which pip /.conda/envs/myenv/bin/pip conda install numpy pip install click

Conda environments: Default locationBy default, new environments will be inchanged with HOME/.conda/envs . conda config --prepend envs dirs /data/ USER/envsThis can be

Conda environments: A private Anacondainstall wget est-Linux-x86 bash Miniconda3-latest-Linux-x86 -b -p /path/to/my/conda export PATH /path/to/my/conda/bin: PATH which python/path/to/my/conda/bin/python conda install numpy.

Python science stackNumeric: numpy, scipy, pandas, statsmodel, numbaScienti c: biopython, astropy, nipy.orgVisualization: matplotlib, seaborn, ggplot, bokehMachine learning: scikit learn, tensor ow, theano, kerasTools: jupyter, ipython, conda, bioconda

More talks:HPC Python / TACCHPC Python / Blue WatersHPC Python / Exascale.orgHPC Python / PRACEHPC Python tutorial

[email protected]

A quick word about Python 3 vs 2 Python 3 made backwards incompatible changes (print, exceptions, division, unicode, comprehension variables, open(), .). There used to be reasons to write new code in Python 2, but they are getting less compelling. Soon-ish support for Python 2 by the c