This section provides some ramblings on commonly encountered problems.
Note that it has not yet been updated for the alpha test release of the
MPIBLACS.
General problems:
- Undefined BLACS symbols during link.
PVM-specific problems:
- Spawned processes do not check in.
- Strange behavior while using blacs_setup.dat.
- Got message like "pvm error #XXX".
- Code hangs when run on RS6000.
- Code hangs when run on multiple machines.
Undefined BLACS symbols during link
The BLACS routines are roughly divided into two categories. The top level codes
(i.e. the ones callable by the user), and those routines that these top
level routines call. The top level routines provide the interface
for the library. Non-interface routines are referred to as internal
routines. Interface routines are those documented in the manual or quick
reference guide. Internal routines will have names that vary from system
to system. Many may end in the postfix 00. Examples include
Smpath_bs, Asend00, itrpack00 etc. If you have a library
produced by UT, you can discover the names of the interface routines
by doing an ls in your ...BLACS/SRC/<ARCH>/
directory. Internal routines are in
...BLACS/SRC/<ARCH>/INTERNAL/
If all of the missing symbols are interface routines, then you probably
have an interface problem. If they are all
internal routines, then you probably have a
internal problem. If the missing symbols include
internal and interface routines, you are probably pointing at an
invalid library.
Invalid libraries are usually fairly straight-forward.
You may be pointing at a BLACS library that does not exist. If you are
maintaining BLACS for several different systems, you may be pointing at
the wrong system's libraries. For instance, you may want the HP version
of the PVM BLACS, but you are pointing at the SUN4 version.
There are a couple of things which commonly cause interface
problems. The most likely is that the a particular interface was not installed.
The BLACS may be installed for Fortran and/or C: the installer may choose
to install only one.
If the missing symbols all begin with C, then you need to build
the C interface library. This can be done from the top-level makefile by
typing make intface=Clib. If it is instead
the Fortran interface that has not been installed, this may be accomplished
from the top-level makefile by make intface=F77lib
Another possibility is that the BLACS library was compiled with an
incorrect Bmake.inc file. The BLACS are written in C, but made
to be callable from Fortran77. The method for calling a C routine from Fortran
is system specific. Therefore, Bmake.inc allows the user to
vary the BLACS naming scheme, so that the routines may be called from Fortran.
Bmake.inc contains a macro called INTFACE, which
performs this function. Some systems (e.g. SUN4, CM-5, and Intel) require
an underscore to be postfixed to a C routine name for it to be callable from
Fortran.
This type of interface should be indicated by defining
INTFACE = -DAdd_.
Others (e.g. HP, RS6000) let Fortran and C share the same name space,
so that no change is required to call a C routine from Fortran.
This type of interface should be indicated by defining
INTFACE = -DNoChange.
Finally, some systems (primarily CRAY) require the the C routine name be
in upper case for it to be callable from Fortran.
This type of interface should be indicated by defining
INTFACE = -DUpCase.
Having on the internal routines missing happens most often to PVM users
where a previous platform's internals have been compiled into this library
(i.e., the internals for HP are compiled into a SUN4 library) .
If this is the case, remove the library, do a make clean
and rebuild.
Spawned processes do not check in
This usually indicates that there is insufficient memory to spawn the
required number of processes. The pvm_spawn succeeds, but when the system
attempts to allocate the process's memory, it fails, and thus the process
never truly gets started. The usual fix for this is to add more hosts,
or to reduce the size of your executable.
Strange behavior while using blacs_setup.dat
Old blacs_setup.dat files often remain around after their use, and when you
run your next program, the old file is accessed, causing the wrong executable
to be spawned. The BLACS do not
require you to use blacs_setup.dat, and it is recommended that you do not.
If you require the extra power blacs_setup.dat gives (e.g., you need to spawn
with debug), then of course it should be used.
Got message like "pvm error #XXX"
The BLACS may encounter a PVM error that they are not designed to handle.
In that case, they simply abort after reporting the error. The number printed
is a PVM error number. The meaning of these error number can be found in
the PVM manual or quick reference guide.
Code hangs when run on RS6000
This should only happen when you use an old PVM version, and
perform some very strenuous communication, such as running the BLACS tester.
The recommended fix is downloading the newest version of PVM.
Code hangs when run on multiple machines
Usually this is caused by PVM dying because there were too many messages,
and the network was too busy to service them. Eventually, one of the
pvmd3's will go down, and the network will get confused. Examine your
pvml.<user id> file on each system for clues. Usually, you'll get something
like "lost track of master, you're screewwwweeed".