5.2.1 Serial Initialization

The main tasks during the serial initialization phase are summarized in Fig. 5.4. When the application is started, the different instances of the application on various processors are initialized and synchronized. Then the first processor reads the configuration file for the simulation and evaluates the parameters and command line arguments. This is followed by reading the finite element mesh, reading the material parameters file and initializing the magnetization distribution. If requested, the finite element mesh is refined globally before it is partitioned using the Metis library.

**Figure 5.4:** Flow chart of the single processor initialization section.
$\includegraphics[scale=0.5]{fig/talk/fig/fluss2.fig.eps}$

Metis uses a very efficient serial multilevel

-way partitioning algorithm [67,68], which gives high quality partitionings. First, the structure of the finite element mesh has to be mapped to a graph suitable for processing by Metis. The structure of the mesh is in its simplest form defined by a list of elements, for each of which the ids of its four vertices are given. The utility function METIS_MeshToDual generates from this information the dual graph in which each vertex corresponds to an element in the mesh and each edge connecting two vertices indicates, that the corresponding elements share a face. (This information is later also used to detect the boundary triangles of the finite element mesh and to calculate the boundary matrix as described in Sec. 3.4.) Then the graph is partitioned in such a way, that the mesh is split into equal-size parts and the number of adjacent elements assigned to different processors is minimized. The first condition leads to a load balancing among the processors (assuming a set of equal processors), whereas the second should minimize the communication between different processors resulting from the assignment of adjacent elements to different processors. Based on the partitioning of the elements the nodes of the mesh are assigned to those processors, to which the majority of elements, which shares a given node, belongs (subject to balance constraints). Each processor then has (approximately) the same number of elements and nodes and a minimum of faces and nodes, which are shared with other processors. Fig. 5.5 shows the partitioning of the finite element mesh of a flat cylinder. The model is cut parallel to the axis of the cylinder, which results in the smallest areas of the cut planes and the smallest number of faces and nodes, which are shared by different partitions.

**Figure 5.5:** Mesh partitioning of a soft magnetic nanodot for two, three, four, and ten processors, respectively. Different colors correspond to parts which are assigned to different processors.
(a) $\includegraphics[scale=0.3]{fig/dots0200651/p2/p22.eps}$ (b) $\includegraphics[scale=0.3]{fig/dots0200651/p3/p34.eps}$ (c) $\includegraphics[scale=0.3]{fig/dots0200651/p4/p44.eps}$ (d) $\includegraphics[scale=0.3]{fig/dots0200651/p10/p104.eps}$

The dual graph of the finite element mesh is also used to prepare the calculation of the boundary element matrix involved in the calculation of the demagnetizing field (cf. Sec. 3.4). There we need the boundary faces of the finite element mesh. Since we know the element connectivities from the dual graph, we immediately know all boundary elements, since they have at most three neighbors with which they share a common face. As we loop over all these elements, we just have to compare the faces with those of the neighbors to find out, which face lies on the surface of the finite element mesh. Especially this step would be quite troublesome, if the mesh was already distributed, because it would require a lot of communication and bookkeeping. Then, the boundary nodes are easily identified by the boundary faces.

According to the partitioning scheme suggested by Metis all data are then distributed to their destination processors. Many parameters like the initial time, number of nodes, elements, and faces, material parameters and vertex coordinates are broadcast to all processors. Other data arrays like the vertex to element assignment, element properties, and other element data are delivered just to the processors, which ``own'' the element. However, these data structures make only a minor part of the total required storage space.

All these steps are done only by the first processor, but the serial initialization part is very short and not computation intensive, which is the reason for the excellent scaling of the program initialization, which will be shown in Sec. 6.3.