Parallelization of the solve phase in a task-based Cholesky solver using a sequential task flow model

We describe the parallelization of the solve phase in the sparse Cholesky solver SpLLT when using a sequential task flow model. In the context of direct methods, the solution of a sparse linear system is achieved through three main phases: the analyse, the factorization and the solve phases. In the last two phases, which involve numerical computation, the factorization corresponds to the most computationally costly phase, and it is therefore crucial to parallelize this phase in order to reduce the time-to-solution on modern architectures. As a consequence, the solve phase is often not as optimized as the factorization in state-of-the-art solvers, and opportunities for parallelism are often not exploited in this phase. However, in some applications, the time spent in the solve phase is comparable to or even greater than the time for the factorization, and the user could dramatically benefit from a faster solve routine. This is the case, for example, for a conjugate gradient (CG) solver using a block Jacobi preconditioner. The diagonal blocks are factorized once only, but their factors are used to solve subsystems at each CG iteration. In this study, we design and implement a parallel version of a task-based solve routine for an OpenMP version of the SpLLT solver. We show that we can obtain good scalability on a multicore architecture enabling a dramatic reduction of the overall time-to-solution in some applications.


Introduction
In this study, we are solving the linear system where A is a large sparse symmetric positive-definite matrix. In order to do this, we use a direct method where the solution process consists of three main steps: the analyse, the factorization and the solve phases. We use a Cholesky factorization given by where L is a sparse lower triangular matrix, and P is a permutation matrix needed to preserve the sparsity. Since L is triangular, the solution can be obtained using a forward substitution step followed by a backward substitution step so that x ¼ P T z is the solution to the system in equation (1). Note that, although the factor L has a sparse structure, it is usually denser than A because of fill-in. The purpose of the analyse phase is to determine precisely the structure of L and to find a pivot order for limiting the fill-in. The returned permutation matrix P reduces the storage for the factor as well as the number of floating-point operations for factorization and solution.
In this article, we concentrate on the steps in equations (3) and (4). Most work on sparse solvers concentrates on the factorization in equation (2) because that is the most costly step of the solution process. As a consequence, the solve phase is often not as optimized as the factorization in the state-of-the-art solvers, and opportunities for parallelism are often not exploited in this phase. However, in some applications, the time spent in the solve phase is comparable to or even greater than the time for the factorization, and the user could dramatically benefit from a faster solve routine. This is the case, for example, for a conjugate gradient (CG) solver using a block Jacobi preconditioner. The diagonal blocks are factorized once only, but their factors are used to solve subsystems at each CG iteration.
Runtime systems such as StarPU (Augonnet et al., 2011), OpenMP and PaRSEC (Bosilca et al., 2013) provide a high-level API for implementing task-based algorithms where the directed acyclic graph (DAG) of tasks can be represented using different models. The most userfriendly model is the sequential task flow (STF) supported by the vast majority of runtime systems. Alternative models include the parametrized task graph (PTG) model supported by only a few runtime systems such as PaRSEC. Although the PTG model can offer some benefits in terms of performance over the STF model, it is generally much more difficult to use in practice especially in the context of sparse algorithms where the task dependency pattern is very irregular.
When using runtime systems, the runtime system is responsible for handling the DAG on the target architecture by managing the data dependencies, data coherency and task scheduling. One main advantage of this compared to using low-level synchronization mechanisms lies in the fact that it offers better portability and maintainability of the code. Furthermore, we conducted extensive experiments on using runtime systems in previous work for a sparse Cholesky factorization. We used both the STF model  and the PTG model , in the context of the SpLLT solver, and we showed it was efficient in terms of performance when running on multicore architectures compared to the state-of-the-art solver HSL_MA87 from the Harwell Subroutine Library (HSL) (http://www.hsl.rl.ac.uk/).
In this article, we extend our approach to the solve phase in SpLLT and design DAG-based algorithms for implementing the parallel version of the forward and backward substitutions. Our implementation relies on an STF model, and the parallelization is achieved using the tasking features available in the OpenMP standard. We experimentally show that, using our DAG-based approach, we are able to obtain much better performance for computing the forward and backward substitutions on multicore architectures compared to the state-of-the-art solvers PARDISO, PaStiX and HSL_MA87.
In Section 2, we discuss the solve phase for dense systems, since we will use similar kernels in our sparse solve phase. We then describe the sparse solve algorithm in Section 3 before considering how to implement this in parallel in Section 4. We present our experimental results in Section 5 where we illustrate the scalability both with respect to the number of cores and to the number of right-hand sides (nrhs). We also compare our code with other state-of-theart codes. Finally, we present some conclusions in Section 6.

DAG-based dense forward and backward substitutions
In this section, we discuss in detail the forward and backward substitutions for the case of dense matrices. For the sake of clarity and without loss of generality, we do not include the permutation matrix in this section. When performing these operations, we first partition the lower triangular dense factor L 2 R n Â n into blocks as: where L ii 2 R nb Â nb are lower triangular matrices for 1 i < k, and L kk is also a square lower triangular matrix but with dimension n À ðk À 1Þnb nb. We first discuss the forward substitution in Section 2.1 before considering the backward substitution in Section 2.2.

Forward substitution
Consider first the forward substitution in equation (3) that we present in Algorithm 1.
Algorithm 1 shows that the forward substitution requires two computational kernels. The first kernel at line 3 corresponds to the classical triangular solve, that is referred to hereafter as the solve kernel, denoted by dtrsv in the BLAS (Blackford et al., 2002). We note that the solve kernel is different from the above-mentioned solve phase and is indeed just part of this phase. The second kernel at line 5 is the update of the right-hand side by performing a matrixvector product, which is referred to hereafter as the update kernel, dgemv in BLAS parlance.
From Algorithm 1, we note that the update follows the solution of part of the right-hand side and triggers the solution of another part of b. That is, there is a dependency between the update and the solve. We can generate a DAG to show the dependencies in Algorithm 1. To illustrate this, we consider the case k ¼ 4 and present the associated DAG in Figure 1 The DAG starts by solving the equation L 11 y 1 ¼ b 1 , represented by the top node S. Once, y 1 is computed, an update of b can be performed. The updates to b 2 , b 3 and b 4 correspond to different parts of b, and they can all be performed in any order. The update of b 3 using y 1 and the Algorithm 1. forwardSolve (L, b, nb) update of b 3 using y 2 have to be completed before solving L 33 y 3 ¼ b 3 . However, as the destination space is the same for both updates, these two operations cannot be performed at the same time. We see from the DAG in Figure 1(b) that some parallelism can be exploited in the updates to the right-hand side.

Backward substitution
Once the forward substitution has returned the vector y, the solve phase performs a backward substitution to solve equation (4). In this equation, we use the transpose of L to compute the solution vector x. However, this substitution does not really differ from the previous one, only the loop order is changed, as shown in Algorithm 2. Similar to the forward substitution, this algorithm requires two computational kernels, the solve kernel and the update kernel, which give the same DAG as shown in Figure 1

DAG-based solve in sparse case
In this section, we extend the algorithms presented in Section 2 to the sparse case. We now consider A as a sparse matrix of dimension n Â n. The first step for solving equation (1) in SpLLT is the analysis of the pattern of the matrix A in order to reduce the fill-in in L during the factorization. This approach to solving sparse systems is welldocumented, and we refer the reader to Duff et al. (2017) for the details. This step can use an ordering algorithm such as AMD (Amestoy et al., 1996) or Metis (Karypis and Kumar, 1998). The analysis then generates a tree representation for the factorization process. We first present the internal representation of the L factor in SpLLT and then discuss the design and implementation of the forward and backward substitutions.

Internal representation of the L factor
At the root node of the tree, that part of the L factor is held as a dense lower triangular matrix, and we will partition it as in equation (5). For the other tree nodes, the L factor is stored as a dense trapezoidal matrix,L, that can be written as a square lower triangular matrixL 1 stacked over a rectangular matrixL 2 , namelỹ In the parlance of tree-based factorizations, the blockL 1 corresponds to variables that are fully summed, that is to say variables that are appearing for the last time and do not appear at any ancestor nodes in the tree, so the corresponding unknowns can be solved as in Algorithm 1. The entries inL 2 are then used to perform updates to the variables that will later be fully summed at the parent or ancestor nodes of the tree. We can use these dependencies to generate a DAG for the whole solve phase that allows us to use inter-level parallelism. That is, a task at a node can be processed before some of the tasks for its children as long as the task has all its dependencies satisfied. We emphasize in passing two properties of the tree. Firstly, the rows within a block iñ L 2 may match rows of more than one block in the parent. Secondly, one row in the fully summed part of a node can be shared with more than one child node.
We illustrate this in Figure 2, where the factor L consists of three nodes. The parts of the DAG associated with the leaves of the tree are very similar to the DAG shown in Figure 1. We label the blocks from the leaves to the root. If Algorithm 2. backwardSolve (L, y, nb) there are rows in common between blocks blk 7 and blk 15 then there will be a dependency between y 2 and y 7 . Moreover, we observe that there is at least one row present in all three nodes, so that the solve using the corresponding block in the parent node requires the updates from both the children. Note that the first solve task in the root node has no dependency with its children. The resulting internode parallelism means that the solve of the corresponding part of the solution vector, y 5 , can be executed before the tasks at the child nodes complete.
For efficient implementation of the forward substitution, we must modify Algorithm 1 to take account of three elements. Firstly, the L factor is spread over the nodes of the tree. Secondly, the shape of theL associated with each node of the tree is trapezoidal, except for the root. Thirdly,L 2 is composed of rows that have discontinuous indices in L so that indirect addressing is required.

Dense kernels and data movement
Our construction of the tree representation of the factor L is such that the rows of a node may not correspond to contiguous entries in b. However, the computational kernels do not handle indirect addressing. This leads to data movement between the global vector and a local vector so that the data are contiguous in the local vector. Thus, a local vector is associated with each node of the tree and is split into two parts: the part associated withL 1 consists of a contiguous portion of the global solution vector, whereas the second part is a workspace. One way in which we avoid indirect addressing is to use the right-hand side vector permuted according to the pivot ordering. That is to say that, for the forward substitution, we work from the vector Pb in equation (3). Thus, the part of the vector corresponding to the triangular blockL 1 will be contiguous in the global vector and so we can work directly on this when solving for the fully summed variables. We believe this to be a novel feature for the sparse solve phase. We do not actually permute the right-hand side prior to the forward substitution but only when the entries of the right-hand side are needed by the solve kernel.
3.2.1. Forward solve and forward update kernels. We illustrate the two parts ofL and the movement or equivalence of global and local vectors in Figure 3. As mentioned earlier, the part ofỹ corresponding toL 1 is identical to the appropriate part of the global vector so no indirect addressing is required. The remaining part ofỹ is a workspace local to the node. This workspace is related to the global vector through indirect addressing corresponding to row indices in the blocks ofL 2 . The first part of the local vector is set with the appropriate entries of Pb. In addition, both parts of the local vector may require a gather operation with workspaces of the children of the node. That is, when a row of L is present in a child of the node and the associated block in the child is processed, there is an update of the local vector ofL using some entries from the child workspace.
In Figure 3, we show in the dark shaded areas the solution for the fully summed variables that involve the blockL 1 with the blocks y i and y i þ 1 computed in-place. SinceL 2 is composed of rows whose indices in y are discontinuous, indirect addressing is required in order to map these components into the contiguous local vector. These are then used for computing the matrix vector productL 3jỹj , with j ¼ 1; 2, and the resulting vector is subsequently scattered directly into the parent local vector. This is also a novel feature of the code.
3.2.2. Backward solve and backward update kernels. The backward substitution shares the same data locality issues as for the forward substitution. Consider the same node as in Figure 3 with its associated dense lower trapezoidal matrix L. In order to update the solution vector z ¼ Px from equation (4) withL T 2 , the backward update kernel has first to gather the corresponding components of z into a local vector, denoted byz in Figure 4. Then, the update of z is performed by the computation of ðL 31 Þ Tz 3 and ðL 32 Þ Tz 3 . Each triangular solve computes part of Px. That is, the result of the triangular solve is then permuted to obtain the solution vector x.

Sequential sparse forward solve algorithm
Moving from the dense forward solve in Algorithm 1 to the sparse case requires several changes. Firstly, as shown in Figure 2, the L factor is now a tree, where the nodes are associated with trapezoidal matrices, and has internode dependencies. Secondly, as shown in Figures 3 and 4, pro-cessingL 2 leads to indirect addressing. We present the sequential sparse solve in Algorithm 3. This algorithm takes as input the same parameters as Algorithm 1, except that the representation of the factor L is a list of nodes of the tree. The forward substitution operates over the nodes from the leaves to the root of the tree. For each node, we have the associated dense matrixL, as written in equation (6), and a local vectorỹ. We start by forming the part ofỹ associated withL 1 . That is, we gather the corresponding components of Pb with the contributions of the children of the node. This operation overwrites the content so that we avoid resetting the local vector. Then,L 1 is processed as in Algorithm 1. An additional loop is introduced in Algorithm 3 to processL 2 . This step is done so that the result of the first update of eachỹ j , with j 2 i þ 1; . . . ; l f g , overwrites its content. Finally, we sum the contribution from the children to the local vector.
The backward substitution algorithm is very similar but, in this case, we use the L factor starting with the root node and working down the tree to the leaf nodes.

Parallel implementation of the solve phase
As mentioned in Section 1, the parallel implementation of our task-based algorithms, introduced Section 3, is achieved using a runtime system which is responsible for executing the DAG in parallel on the target architecture. This involves enforcing data dependencies to ensure the correctness of the execution and managing the task scheduling for efficiently exploiting the resources. This design strategy has the advantage of producing a code that is portable and easy to maintain.
For our implementation, we decide to rely on an STF model for its greater usability than the PTG model in the context of complicated algorithms such as sparse forward and backward substitutions. The difficulties for exploiting a PTG model compared to the STF model were exposed in previous work for the parallelization of the factorization phase in SpLLT . The simplicity of the STF model lies in the fact that the parallel code is very similar to the sequential one. In this model, the tasks are submitted to the runtime system following the sequential algorithm and for each task, the manipulated data must be associated with a data access such as Read (R), Write (W) and Read-Write (RW) depending on the operations performed on the data. Using the task submission order and data access information, the runtime system is capable of building the DAG such that the sequential consistency is guaranteed during the parallel execution. Finally, we choose to use the OpenMP standard for our implementation, because we observed in previous work that it comes with a slightly lower overhead for the task management than a runtime system with many extra features such as StarPU .
In the following, we first show how the data dependency is determined using the representation of the factor L, and then, we describe the task management with OpenMP.

Data dependencies of a task
From the DAG as presented in Figure 2(b), we associate a task with a computational kernel that takes as input a block within a node of the tree and the related part of the input and output vector. The sparse aspect requires an additional task that handles the indirect addressing. This task forms the local vector by adding the contributions of the children. The submission of the tasks related to the forward substitution is presented in Listing 1. The computational operations within a node in Algorithm 3 are now considered as tasks that are submitted to the runtime system. We provide in addition the access mode to the data in each task to enable the runtime system to detect the dependencies between tasks.
The data accessed local to a node, in lines 9, 11, 16 and 20, is similar to the dense case in Algorithm 1 and is independent of the pattern of the matrix. However, the sparse aspect requires gathering entries from a set of local vectors that share rows with the local vector,ỹ i , of the node. We Listing 1. Task-based implementation of Algorithm 3. Algorithm 3. forwardSolve (L, b, nb) refer, hereafter, to this set as the data dependencies forỹ i , denoted by dep i . Once the data dependencies are computed in line 4, a task is submitted in line 5 that overwritesỹ i with the appropriate part of Pb added to the contributions coming from the vectors in dep i . On the other hand, in line 26, the submitted task modifiesỹ i by gathering the contributions coming from the vectors in dep i obtained in line 25.
We show how dep i is computed in Algorithm 4. As in lines 4 and 25 of Listing 1, the initial parameters are the row indices ofỹ i and the node that owns it. This algorithm creates dep by visiting the children of the node in line 3. For each child node of node, the part of the local vector that shares at least one row index withỹ i is considered as a data dependency. Note that, in practice, we compute this list once before the solve, and we use the list during the forward and backward substitutions. This saves computations when the solve phase is done multiple times.
The submission of the tasks for the backward substitution is similar to the forward substitution, except for the computations of the data dependencies, where the role of child nodes is replaced by the ancestor nodes.

Management of the tasks
Our implementation uses the runtime system OpenMP, Release 4.5. This release offers a runtime system that schedules the execution of tasks with dependencies. There are several ways to describe the dependencies. One simple way is to provide the address of the data that are being accessed through a read, write or read/write mode as in Listing 1. Thus, the dependencies have to be known at compilation time by OpenMP. However, the set dep returned by Algorithm 4 has a variable length and its length cannot be predicted prior to the execution of the code.
To handle a variable length of dependencies in OpenMP, we use a k-ary tree combined with synchronization tasks to control the scheduling of the task. A synchronization task is a task submitted to the runtime system that does nothing, like a no-op instruction, except to release a dependency for further use. The purpose of using a k-ary tree is to reduce the number of tasks from N ¼ jdepj to c, an acceptable number of dependencies, given that we have to statically define each possibility in our code. To do this, we define a chunk, that is, a subset of unsatisfied dependencies, equal to c. The k-ary tree is processed as follows. At the first level of the k-ary tree, the dependencies in dep are considered to be unsatisfied and are split into dN =ce chunks. For each chunk, a synchronization task is submitted to the runtime system. This synchronization task is scheduled when all dependencies in the chunk are satisfied. It then releases the first dependency of the chunk for the next level by flagging it as unsatisfied. Thus, the k-ary reduction leads to at most c unsatisfied dependencies, as presented in Figure 5. In this example, the size of the chunk is 5, and N ¼ 35. A white square represents an unsatisfied dependency, and a grey rectangle represents a set of satisfied dependencies. The first dependency of each chunk is still unsatisfied between two levels. A task is finally submitted to the runtime system with a list of dependencies composed of the remaining unsatisfied dependencies of dep.
Along with this k-ary tree, we use a case statement over the number of unsatisfied dependencies in dep, where each case corresponds to the submission of a (synchronization) task having i dependencies (0 i c), where c is 10 by default.

The pruning strategy
It is worth noting that the matrix sizes in the tree can vary significantly between the nodes of the tree. More specifically, the matrices are generally small near the leaf nodes and get bigger as we traverse the tree toward the root node. As a consequence, there is plenty of inter-level parallelism at the bottom of the tree with limited opportunity for intranode parallelism. On the other hand, there is much less inter-level parallelism near the top of the tree, where there is much more intra-node parallelism. In our task-based algorithm, this would result in the creation of many small granularity tasks at the bottom of the tree which can be costly for the runtime system to manage compared to the actual workload being performed in these tasks. In addition, inter-level parallelism at the bottom of the tree can greatly exceed the amount of available resources. Therefore, a large number of tasks is generated at this level, increasing the cost for the task scheduling without bringing much benefit in terms of performance. In order to alleviate  Figure 5. Content of the set dep at each level of the k-ary tree, with a chunk size c ¼ 5. Each white square represents a dependency in dep, and each grey rectangle is composed of dependencies already satisfied. these problems, we employ a tree pruning strategy, largely inspired by the one used in the qr_mumps solver and described in the work of Buttari (2012). This allows us to dramatically reduce the number of small granularity tasks in the DAG. This pruning strategy aims at identifying a set of subtrees which are processed as a single task meaning that each subtree is processed in serial mode.
In more detail, the pruning algorithm performs a topdown traversal of the tree and selects a sufficiently large set of independent nodes in the tree so that the computational weight of the subtrees rooted at each node in this set is balanced. In practice, we force the number of subtrees to be equal to at least twice the number of resources, because we want to generate enough parallelism to compensate for potential load imbalance during the execution of the subtree within the whole DAG. One potential source of imbalance is that the metric used for estimating the computational weight for each subtree, the number of floating-point operations, might not accurately reflect the execution time.
This pruning strategy allows us to dramatically reduce the number of tasks submitted to the runtime system as well as the number of dependencies in the DAG. Moreover, processing the subtrees as a single task improves the exploitation of data-locality which increases the kernel efficiency.

Experimental results
In order to assess the performance of the solve phase of our SpLLT code, we ran tests on a set of 37 symmetric definite positive matrices, presented in Table 1. These matrices come from the Suitesparse matrix collection Davis and Hu (2011) and are those used in the work of  and Hogg et al. (2010). They correspond to a wide range of different applications including computational Gas reservoir a n is the matrix order, nz (A) represents the number of entries in the matrix A, nz (L) represents the number of entries in the factor L and Flops corresponds to the operation count for the matrix factorization.
fluid dynamics (CFD), circuit simulation, finite element modelling (FEM) and gas reservoir modelling. The dimensions of the matrices range from 36 Â 10 3 to 1:5 Â 10 6 . We focus our experiments on parallel efficiency both when nrhs increases and when the number of workers (nworker) increases. We also compare the performance of the SpLLT solve phase with the solve phase of other sparse direct solvers. Note that in the following tables, fastest time are in bold. We perform the tests on a multicore machine which has two Intel Xeon E5-2695, v3 CPUs, that is 2 NUMA nodes composed of 14 cores each, with a total 128 GB of memory. Each core has a theoretical peak of 36.8 GFlop/s for a frequency of 2.3 GHz, so the peak of the multicore node is 1.03 TFlop/s in double precision. The code is compiled with GNU 6.2 and MKL 17.0.2 and linked to Metis 5.1.0 and SPRAL (https://github.com/ralna/spral/). In most cases, we display results on a representative subset of matrices. We have done preliminary tests to study the reproducibility of the execution times of the machine by solving equation (1) 10 times and increasing nworker. Figure 6 shows the range in the times for the solve phase on four of our test problems by horizontal bars. We note that this range can be noticeable when there are few workers, but this variability decreases markedly when nworker increases.
We study the impact of pruning in Section 5.1, of synchronization tasks in Section 5.2 and of block size in Section 5.3. In Sections 5.4 and 5.5, we then compare the solve phase of SpLLT with Intel MKL PARDISO 17.0.2 (Schenk and Gärtner, 2002) and PaStiX (Hénon et al., 2002) by increasing nrhs in Section 5.4 and nworker in Section 5.5.

Impact of pruning on the performance
In this section, we illustrate the impact of using the tree pruning algorithm discussed in Section 4.3 on the solve times. For this, we performed the solve on a subset of the matrices given in Table 1 using one right-hand side and for a number of workers ranging between 1 and 28. Note that for this experiment, we selected the matrices which perform a high number of floating-point operations during the factorization. Our results, given in Table 2, include the solve times, with and without pruning, as well as the number of subtrees selected by the pruning algorithm when using 1, 2, 4, 14 and 28 workers.  It should be noted that for all tested matrices, the best solve times are obtained when the tree pruning algorithm is enabled. Moreover, the impact of pruning is particularly high when nworker is large. For instance, enabling the tree pruning for boneS10 when using 2 workers is insignificant whereas the solve time is divided by a factor of 6 for this matrix when enabling tree pruning on 28 workers. Another notable effect of using tree pruning is that, although the solve times between 14 workers and 28 workers tend to increase without tree pruning, they keep decreasing if tree pruning is activated. This shows the benefits of using the pruning strategy to improve the scalability of our algorithm on a large number of resources.
As mentioned in Section 4.3, the tree pruning strategy improves the performance mainly for the following two reasons: firstly, it reduces the number of tasks in the DAG and more particularly small granularity tasks, and secondly, it increases kernel efficiency by improving the exploitation of data locality. This can be seen when looking at the data in the first column of Table 2. When using only one worker and disabling the pruning, all the tasks in the DAG are generated, submitted to the runtime system and processed by a single thread. On the other hand, when the pruning is enabled, the number of subtrees is equal to one which means that only one task, responsible for processing the whole tree, is generated. In this case, the cost for the task management is negligible and, as we can see in Table 2, the solve times are always reduced.

Impact of the number of synchronization tasks on the performance
As discussed in Section 4.2, tasks have a variable number of dependencies that we handle through a k-ary tree. In this section, we give details of the effect of the synchronization tasks on the solve time. We consider the case of 1 righthand side and 64 right-hand sides. Table 3 presents the results on the same subset of matrices as in Table 2. We first focus on the case of one right-hand side where the pruning is disabled. We observe that the time to solve is the highest when c is equal to two. Moreover, the number of synchronization tasks is divided by at least a factor of 3 when increasing the chunk size from 2 to 3. After some point, the number of synchronization tasks is low enough that reducing it further does not noticeably improve the execution time. When pruning is used, we observe that the number of synchronization tasks is divided by at least a factor of 2 for a chunk size of 2. This means that the sensitivity of the times on the chunk size is much reduced when pruning is used. Likewise, when we solve for 64 right-hand sides, the extra work required means that the number of synchronization tasks has little impact on the time, with or without pruning. The arithmetic intensity in the kernels is much larger than the overhead of scheduling and executing synchronization tasks. However, it is possible that for larger workloads than our test matrices, a small chunk size would penalize the performance because of a high number of synchronization tasks. For this reason, we use a chunk size of 10 in our solver that we consider to be large enough to avoid noticeable impact on the results on the tested matrices and potentially on bigger problems as well.

Impact of the block size on the performance
In this section, we study the impact of the block size on the time to solve the system and show our results in Table 4. We consider matrices that require a large number of flops during the factorization so that the block size should be greater than 256. We set nworker to 28 and nrhs to 128. For all these matrices, the optimal block size for the solve phase is lower than the optimal block size for the factorization. This leads us to conclude that we should use a different block size, depending on nrhs and the application. For example, in a code like preconditioned CG, the overall performance may depend on the time spent in applying the preconditioner. In that case, an efficient factorization is not as important as an optimal solve. We use a default block size of 256 in the following experiments to obtain a fair comparison with other solvers.

Strong scaling on nrhs
In this section, we focus on the effect of nrhs on the time to solve the system. We fix nworker to 28 and the block size to 256, and we vary nrhs from 1 to 128 in powers of 2.
We first compare, in Tables 5 and 6, SpLLT with PAR-DISO, PaStiX and HSL_MA87 when solving for one or two right-hand sides. These results show that PARDISO is the slowest for all the matrices in Table 1. SpLLT is the fastest to solve the system, up to 8 times faster than PAR-DISO for one right-hand side and up to over a factor of 10 on two right-hand sides. There are only five problems in Table 1 for which SpLLT is not the fastest of all four codes for both one right-hand side and two right-hand sides. In two cases, the difference from the fastest code is marginal. The only matrices for which SpLLT is significantly slower are the two matrices nd12k and nd24k and the matrix StocF-1465. In the case of the first two, the assembly tree has long chains that are not combined by pruning so the runtime overhead for many small tasks causes the poor performance. If we force more node amalgamation in the analyse phase, the number of long chains decreases significantly, pruning is more effective, and we are much more competitive. The StocF-1465 matrix is very reducible with many 1 Â 1 blocks. Again our pruning has little effect and we are penalized by the number of small tasks. We recommend that reducibility is respected in the solution process, and elimination techniques are only used on irreducible blocks. We see that on one of the matrices, Emilia_923, PaStiX is faster than SpLLT for one right-hand side (43.70 Â 10 À2 s vs. 44.70 Â 10 À2 s) and SpLLT is faster than PaStiX for two right-hand sides (47.70 Â 10 À2 s vs. 58.6 Â 10 À2 s). The solve time for SpLLT on two systems is always lower than solving one system twice.
We investigate this for more right-hand sides. In Tables 7 and 8, we give the time for SpLLT, PARDISO, PaStiX and HSL_MA87, for all matrices of Table 1 when solving for 16 and 128 right-hand sides. When nrhs increases, PaStiX and HSL_MA87 become less competitive than SpLLT. We show the ratio of the time for solving several right-hand sides to the time to solve one right-hand side in Table 9. These  results show that the power of level 3 BLAS means that there is often very little overhead for solving two right-hand sides over one and the extra cost when solving for 16 or 128 righthand sides is remarkably low. Since HSL_MA87 does not scale with nrhs, in the following, we compare the time to solve for SpLLT with PARDISO and PaStiX, and we display the ratio of times in Figure 7. Figure 8 presents two representative problems that illustrate the behaviour of each solver. For all nrhs, SpLLT is the fastest solver sometimes by a considerable amount. PaStiX scales much worse than SpLLT as nrhs increases. The scalability of PARDISO can vary considerably (see Serena). Since the source code for PARDISO is not distributed, we do not know why this is the case. We have similarly erratic scalability for several other matrices in our test set. In Figure 8(a), SpLLT is eight times faster   than PARDISO for two right-hand sides, and these two curves increase almost similarly when nrhs increases. On the other hand, the time to solve using PaStiX is close to SpLLT for 2 right-hand sides but is 5 times slower on 128 right-hand sides. In general, SpLLT scales better than the other codes.

Strong scaling on nworker
In this section, we study the impact of nworker on the time to solve a system with 1 right-hand side and 64 right-hand sides. nworker increases from 1 to 28, and the block size is 256. We first consider one right-hand side and present some results in Figure 9. We use the same two matrices as in Figure 8. The experiments show that for a small number of workers, the time to solve using SpLLT is greater than both other solvers. When nworker increases, the solid curve that represents SpLLT becomes closer to the other curves. For 28 workers, SpLLT is the fastest solver. Figure 10 shows the impact of increasing nworker when there are 64 right-hand sides. SpLLT is more competitive than for a small nrhs. In fact, SpLLT outperforms PAR-DISO for all numbers of workers and is faster than PaStiX when nworker increases. For more than nine workers, these two representative problems show that SpLLT is clearly the fastest solver.

Application case: Enlarged CG solver
In this section, we present some results obtained with the enlarged CG solver (ECG) (Grigori and Tissot, 2017). Our runs in this section are on a different machine to our earlier results. We use the Kebnekaise system at Umeå in Sweden (https://www.hpc2n.umu.se/resources/hardware/kebne kaise). Each compute node contains 28 Intel Xeon E5-2690v4 cores organized into 2 NUMA islands with 14 cores in each. The nodes are connected with an FDR Infiniband Network. The total amount of RAM per compute node is 128 GB. ECG and SpLLT are compiled with Intel 18.0.1 and linked to Metis 5.1.0. The ECG solver is a preconditioned CG solver that augments the number of working vectors to reduce the number of iterations as well as to obtain better parallelism and to reduce the amount of communication. ECG uses a block Jacobi preconditioner. Although each diagonal block is factorized only once, the solve of each associated local system is performed at each iteration of ECG. Table 10 presents the time to solve the global system with ECG using PARDISO or SpLLT on a subset of matrices from Table 1. We show results from 16 MPI processes to 256 MPI processes. We see little difference in solution times for the shipsec1 problem but, on the two other problems that are 10 times larger, we observe that replacing the PARDISO Solver with the SpLLT solver leads to a speed-up over a factor of 2. In all cases, solving the system with 16 MPI processes using SpLLT is faster than using PARDISO with 32 MPI processes.   ECG: enlarged conjugate gradient solver. a ECG is set with a tolerance of 10 À5 , an enlarge factor of 12, and SpLLT has a block size of 256 with 14 workers.

Conclusions
We have designed, implemented and tested new routines for the forward and backward substitution steps in the parallel solution of sparse positive definite systems using a Cholesky factorization. We have given details of how our design exploits parallelism while keeping data movement low. We have developed a code that we have tested on a multicore node and have targeted and achieved good scalability both with respect to the number of cores and to nrhs. We have compared our code to other state-of-the-art codes and shown that it is usually far superior sometimes outperforming the other codes by a factor of over 10. We have shown that our code is strongly scalable both for cores and right-hand sides. Finally, we have tested our code within a real application and shown big gains over the previous version that used another solver. We are continuing our work with the authors of ECG at Inria and have seen even bigger gains on markedly larger problems than those used in this article. For example, on a matrix from a diffusion problem of order nearly 5 million with over 34 million entries, we reduce the solution time by a factor of nearly 3 on 16 MPI processes and by a factor of nearly 2 on 256 processes.