The week began with the cluster almost in a ready state, but issues emerged during testing with MPI and SLURM. Specifically:
SLURM (srun): Only ran in singleton mode without errors but failed to scale.
MPI (mpirun): Crashed outright, returning an error: "PSEC: munge failed to create credential: Socket communication error."
The issue was Munge, a tool SLURM uses for authentication across nodes. Although all nodes had it installed and shared the same munge.key, mpirun continued to fail.
After trying several fixes, we found a relevant article by Hossein Ghorbanfekr on Medium, which detailed a similar setup using Raspberry Pi 4 with a Pi 3 as master: he compiled SLURM, and MPI from scratch and his setup was working, however the same authentication issue persisted.
Later, a deeper research persisted by Ernesto revealed an Intel cluster user’s solution:
By setting PMIX_MCA_psec0=native, MPI bypassed Munge and used its default authentication. This change allowed MPI to communicate properly across nodes without the Munge authentication layer.
Testing this:
PMIX_MCA_psec0=munge with srun -n 16 hostname worked.
mpirun still faced issues, which we plan to address when we have more time, as the current priority is communication stability.
We decided to include the previous master Ciz0 (The only node with 8GB of RAM) in the computational nodes and passed the role of the master to Andrea, a Raspberry Pi 3b with 1 GB of RAM, enhancing the cluster specs:
Ciz0 still handles the filesystem... This might be a potential bottleneck when the multiple users try to use the server at the same time since it has more tasks to do.
The problem will be solved when the filesystem will be handled by the master, which will happen when we get more raspberries
Saverio is now satisfied that his previously underutilized hardware was put to work!
Using munge for the authentication will be scheduled for later, as we prioritize cluster stability for the presentation of the 13/11/24 (which has been rescheduled for the 20/11/24)
Professor De Sensi gave us 15 minutes of it's multicore classto present our cluster to the students. This is a big opportunity to expand the plum juice team and to make the world know about us!
The multicore course has been tought to give a deep understanding and practical experience about parallel programming. This is done by proposing the students homeworks where they have to implement classic parallelizable problems like matrix multiplication, histograms computation and integral approximation. The taught libraries are: MPI, OpenMP, Pthreads and Cuda.
The Plum juice team offers to give students the possibility to easily test their applications on a truly multi node system, effectively simulating more expensive clusters like Leonardo.
Ciz: Oh no, just kidding... it can't find the node name. Perfect, we're off to a great start.
Ciz: Guys, the configuration file is full of errors... honestly, it's embarrassing.
Saverio: Umm... maybe because you're running it from the master? Just a thought.
Ciz: Why doesn't poor ciz0 have Slurm installed? This is discrimination!
Ciz: Fine, I just need to add it to the playbook, right?
Ciz: Wait, so it's not installed at all. Great.
Ernesto: Actually, if there's no Munge, Slurm won't even acknowledge you. It's like an honor code.
Ciz: Ok, summary: changed the master password, removed TaskAffinity from the config because it was throwing errors anyway. Don't ask me why it's in the docs, but it doesn't work.
Ciz: Then, added the node name, and now Slurmd is running on all nodes! 🎉
Ciz: Oh, but Slurmctld won't start because it can't create /var/spool/slurmctld. Of course, it doesn't have permissions... nothing works, help me.
... The discussion reached the point where we were considering a full reset ...
This week Lorenzo attended SC24, the world's biggest HPC conference! (what a lucky plum he is...)
It has been a tremendous experience and... OH WAIT, cannot disclose it now, he will write about it!
So, if you want to know what SC24 was about, stay tuned guys
Lorenzo at the International Conference for High Performance Computing, Networking, Storage, and Analysis