ALADIN on a cluster of Linux/Digital Unix workstations

(more details jure.jerman++at++rzs-hm.si )

Introduction

Model ALADIN/SI has been operational in HMIS since two years. ALADIN/SI is still running on the old Digital Unix Alpha 600 5/333 workstation and on a no name machine with 500 Mhz Alpha processor. It is true that computers have become faster in last two years and much cheaper, but single processor workstations are still not fast enough to allow major step in increasing of  ALADIN resolution. But situation has changed with appearance of distributed memory version of ALADIN code. Now we can build powerful cluster of workstations from cheap "off the shelf" components.

A test cluster of 20 workstations was built based on Alpha processor and Linux operating system. ALADIN was successfully ported to this environment. The final result was a machine, able to run ALADIN with excellent price/performance ratio.     The model ALADIN  has been tested in distributed memory mode on a cluster of workstations. Cycle AL09_CY19T1 with the newest bugfix has been used.  The only configuration tested has been configuration 001 with 2 TL semilagrangian, LFPOS=TRUE, no DFI. The major goal  of the tests was to obtain some estimation on  performance of the code in DM mode in workstation environment.

Test have been performed on  homogenous "Beowulf" type cluster of workstations based on Alpha CPU and Linux RedHat OS.  Tests were done with MPI as inter-processor communication protocol.

Porting of ALADIN on a cluster

Besides hardware (few workstations connected together with FastEthernet), proper communication software must be used. There is support for both MPI and PVM inside ALADIN code, but we focussed our attention on MPI.

There are two mainly used public domain MPI implementations: LAM and MPICH.  They provide MPI environment for majority of Unix systems. Tests have been done with both of them and they performed quite the same. LAM is maybe a bit more users friendly and it comes with basic MPI Users Guide. LAM has also a fancy XMPI job launcher with a lot of colourful effects.. It also offers more tools for verifying the communications . MPICH offers more freedom in a way how to start the job.

For further information refer to http://www.mcs.anl.gov/mpi/mpich/ (MPICH) and http://www.mpi.nd.edu/lam/ for LAM.

Changes in the Aladin code

For detailed instructions on Implementation of AL09 on DEC workstations please refer to mail of Gabor Radnoti submitted to alabobo mailing list. Only two additional routines were changed in order to be able to run Aladin in DM mode;

It is very important to substitute file in xrd19/mpe/include/mpif.h with the one coming with MPI (LAM/MPICH) distribution before compilation of xrd19/mpe. The mpif.h file is tailored to specific MPI implementation and contains definitions used in mpe routines.

The size of the messages received, type of the message and message tag,  are coming from an argument of mpe_recv.F, which is  an array. The size of  the array (irecv_status) has     to be declared explicitly as INTEGER*4

Instead of:

call MPI_get_count(irecv_status,MPI_BYTE,ilen,kerror)

     krcount = ilen/ibytes

     krfrom  = irecv_status(MPI_SOURCE) + 1 ! MPI_SOURCE = 1

     krtag   = irecv_status(MPI_TAG)        ! MPI_TAG    = 2

the code was changed to:

krcount = irecv_status(1)/ibytes

     krfrom  = irecv_status(MPI_SOURCE) + 1

     krtag   = irecv_status(MPI_TAG)

     kerror  = irecv_status(MPI_ERROR)

MPI protocol does not guarantee distribution of command line options to all processes. (there are no problems with LAM, but if MPICH is used, command line options have to be sent from master processor to all others explicitly). In order to achieve portability between different MPI implementations, suarg was modified. mpe_broadcast   function is used to broadcast command line options to all non-master processes.

Hardware

A test cluster consisted  of 21 Alpha workstations. It was built just for test purposes and it will be disassembled in some time. All nodes have the same architecture except master node which has additional network card and additional memory.

Typical node configuration :

CPU

533 Mhz 21164PC Alpha

Memory

128 MB, master node 192

Network

1 x Full Duplex 100 BaseT , master node 2 x

Hard disk

4 GB

Boxes are connected trough 100BaseT FastEthenet network and FORE switch. It turned out to be crucial to have the switch instead of the hub because network throughput is much higher with switch. The estimated cost of a cluster like this is 2.500$/node.

Performance could be improved with better network (Myrinet or Gigabit Ethernet), but this would add additional 1000$ per node for 2-4 times better network performance. In addition, new type of machines with new Alpha (21264) processor are already on the market. They are 2 times faster than machines with 21164 processor, but price/performance ratio is roughly the same for both.

Software

Besides Linux RedHat 5.2. as operating system  MPICH, LAM as a MPI implementations were used.  There are additionally cluster tools (bWatch, Smile) installed for easier maintenance of the cluster. Queuing system is DQS and it is used to increase efficiency of the cluster.

Results

Theoretical power of Ali Baba cluster is 20 GFLOPS, but in practice is approximately 2-3 GFLOPS what is comparable with 8 processors on CRAY  J-916

Speed up for increasing number of processors  for ALADIN integration can be estimated from figure 2. Speed up is excellent for small number of processors, but there is an effect of saturation for bigger number of them.  The effect of saturation is less intensive for bigger domains. The reasons for saturation are overhead of computations because of domain splitting and increased communication between processors for bigger number of them.

Domain 

NDGL

NDLON

NLEV

Min Time /TSTEP

NPROC

SI

80

80

31

2.3

12

HUN

108

120

31

6.0

12

Envelope

108

144

31

6.5

12

LACE_OLD

192

216

27

12.0

16

Benchmark

256

240

27

17.5

18

Table 1.  Performance of cluster for different integration domains. Min time/TSTEEP is the minimum integration time for 1 time steep which is different for different domains.

Future work

We are now at the stage when we can declare a cluster solution  working for computation of ALADIN, but it still needs some tuning and further efforts.  We still have to work on stability and fault tolerance of the cluster. The work consists of installation of additional software tools, like the Cluster monitoring system. To achieve better usage of the cluster resources, a queuing system will be installed. The queuing system will be DQS (similar to generic NQS) or its commercial version CODINE.

Aladin and Linux

As already mentioned before, ALADIN is running on Alpha boxes with Linux. Machines with Linux OS are much cheaper because Linux is free, there is only one problem namely  that there is no good f90 compiler for Alpha/Linux. There are some compilers available, but performance would be substantially degraded.

There are some rumours, that Compaq will release a f90 compiler (formerly DEC f90) also for Linux. In this case Linux would become very reasonable choice. In our institute the  Linux boxes do not cause  more problems than other brand name Unix machines.

With the appearance of new Linux kernel (2.2) situation has improved a lot. ALADIN is running up to 15 % faster, so the scores listed before are already obsolete. In some cases the same binary compiled on DU is running even faster on Linux 2.2 than on Digital Unix.

Conclusions

Beowulf type of cluster of workstations is a cheap, powerful and scalable solution. Linux OS in combination with code compiled on Digital Unix performs extremely well. Such cluster is very convenient for larger problems while there is a problem of saturation for smaller problems - the scalability is also determined with nature of the problem.

The almost instant success of porting Aladin to  such an "exotic" environment is a great proof of the efficiency of the Aladin code and of the whole project.

Figures

Fig 1. Photo of "Ali Baba" cluster

Fig 2. Computation times for different domains and different number of processors for 1 time steep for normal integration.




Home