AZAZAZ
Posts: 11
Joined: Fri Jul 05, 2019 6:38 pm

SLURM enabled but not starting

Thu Jul 11, 2019 6:08 pm

Hi guys,
Im building a raspberry pi cluster for a project, and I am currently trying to get slurm working. I can enable it, but when i run "sudo systemctl start slurmd", i get the error message shown in the screen shot. COuld this be something to do with the slurm.conf file not being edited correctly by me?
Attachments
Screen Shot 2019-07-11 at 10.43.28 AM.png
Screen Shot 2019-07-11 at 10.43.28 AM.png (84.64 KiB) Viewed 499 times

DirkS
Posts: 9874
Joined: Tue Jun 19, 2012 9:46 pm
Location: Essex, UK

Re: SLURM enabled but not starting

Thu Jul 11, 2019 6:10 pm

So did you follow the advice in the messages (i.e. check the status to get more details)?
And please post text instead of screenshots.

AZAZAZ
Posts: 11
Joined: Fri Jul 05, 2019 6:38 pm

Re: SLURM enabled but not starting

Thu Jul 11, 2019 6:23 pm

Here is what the error code said

Code: Select all

 slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2019-07-11 11:17:53 MST; 3min 59s ago
     Docs: man:slurmd(8)
  Process: 2097 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

Jul 11 11:17:53 Masternode systemd[1]: Starting Slurm node daemon...
Jul 11 11:17:53 Masternode systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE
Jul 11 11:17:53 Masternode systemd[1]: slurmd.service: Failed with result 'exit-code'.
Jul 11 11:17:53 Masternode systemd[1]: Failed to start Slurm node daemon.

DirkS
Posts: 9874
Joined: Tue Jun 19, 2012 9:46 pm
Location: Essex, UK

Re: SLURM enabled but not starting

Thu Jul 11, 2019 6:58 pm

Well, that's not very informative...
You could try to start the program manually (on the command line) and see what the output is.
Or maybe it creates a log file?

AZAZAZ
Posts: 11
Joined: Fri Jul 05, 2019 6:38 pm

Re: SLURM enabled but not starting

Mon Jul 15, 2019 8:22 pm

Here is what it reads when i run "journalctl -xe." Do you see anything wrong?

Code: Select all

sudo journalctl -xe
-- A start job for unit slurmd.service has begun execution.
-- 
-- The job identifier is 1415.
Jul 15 11:16:19 Node02 systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- An ExecStart= process belonging to unit slurmd.service has exited.
-- 
-- The process' exit code is 'exited' and its exit status is 1.
Jul 15 11:16:19 Node02 systemd[1]: slurmd.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit slurmd.service has entered the 'failed' state with result 'exit-code'.
Jul 15 11:16:19 Node02 systemd[1]: Failed to start Slurm node daemon.
-- Subject: A start job for unit slurmd.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A start job for unit slurmd.service has finished with a failure.
-- 
-- The job identifier is 1415 and the job result is failed.
Jul 15 11:16:19 Node02 sudo[1646]: pam_unix(sudo:session): session closed for user root
Jul 15 11:17:00 Node02 sudo[1658]:       pi : TTY=pts/0 ; PWD=/home/pi ; USER=root ; COMMAND=/usr/bin/raspi-config
Jul 15 11:17:00 Node02 sudo[1658]: pam_unix(sudo:session): session opened for user root by pi(uid=0)
Jul 15 11:17:01 Node02 CRON[1671]: pam_unix(cron:session): session opened for user root by (uid=0)
Jul 15 11:17:01 Node02 CRON[1675]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 15 11:17:01 Node02 CRON[1671]: pam_unix(cron:session): session closed for user root
Jul 15 11:17:30 Node02 sudo[1658]: pam_unix(sudo:session): session closed for user root
Jul 15 11:22:27 Node02 sudo[1720]:       pi : TTY=pts/0 ; PWD=/home/pi ; USER=root ; COMMAND=/bin/systemctl enable slurmd
Jul 15 11:22:27 Node02 sudo[1720]: pam_unix(sudo:session): session opened for user root by pi(uid=0)
Jul 15 11:22:27 Node02 systemd[1]: Reloading.
Jul 15 11:22:27 Node02 systemd[1]: Reloading.
Jul 15 11:22:28 Node02 systemd[1]: Reloading.
Jul 15 11:22:28 Node02 sudo[1720]: pam_unix(sudo:session): session closed for user root
Jul 15 11:22:30 Node02 sudo[1774]:       pi : TTY=pts/0 ; PWD=/home/pi ; USER=root ; COMMAND=/bin/systemctl start slurmd
Jul 15 11:22:30 Node02 sudo[1774]: pam_unix(sudo:session): session opened for user root by pi(uid=0)
Jul 15 11:22:30 Node02 systemd[1]: Starting Slurm node daemon...
-- Subject: A start job for unit slurmd.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A start job for unit slurmd.service has begun execution.
-- 
-- The job identifier is 1476.
Jul 15 11:22:30 Node02 systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- An ExecStart= process belonging to unit slurmd.service has exited.
-- 
-- The process' exit code is 'exited' and its exit status is 1.
Jul 15 11:22:30 Node02 systemd[1]: slurmd.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit slurmd.service has entered the 'failed' state with result 'exit-code'.
Jul 15 11:22:30 Node02 systemd[1]: Failed to start Slurm node daemon.
-- Subject: A start job for unit slurmd.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A start job for unit slurmd.service has finished with a failure.
-- 
-- The job identifier is 1476 and the job result is failed.
Jul 15 11:22:30 Node02 sudo[1774]: pam_unix(sudo:session): session closed for user root
Jul 15 11:22:45 Node02 sudo[1790]:       pi : TTY=pts/0 ; PWD=/home/pi ; USER=root ; COMMAND=/bin/journalctl -xe
Jul 15 11:22:45 Node02 sudo[1790]: pam_unix(sudo:session): session opened for user root by pi(uid=0)

ejolson
Posts: 3237
Joined: Tue Mar 18, 2014 11:47 am

Re: SLURM enabled but not starting

Mon Jul 15, 2019 10:34 pm

AZAZAZ wrote:
Thu Jul 11, 2019 6:08 pm
Hi guys,
Im building a raspberry pi cluster for a project, and I am currently trying to get slurm working. I can enable it, but when i run "sudo systemctl start slurmd", i get the error message shown in the screen shot. COuld this be something to do with the slurm.conf file not being edited correctly by me?
Quite likely you have an error in your slurmd.conf file. Exactly where systemd puts the log files, I'm not sure.

Have you checked in /var/log for anything with a slurm in it? Also, have you installed and configured munge already?

drllama
Posts: 11
Joined: Thu Jun 27, 2019 2:05 am

Re: SLURM enabled but not starting

Tue Jul 16, 2019 5:13 pm

The repo version of SLURM stores logs in /var/log/slurm-llnl, have a look there.

I, literally yesterday, put my first RPi4 into my bramble using SLURM. In the end, because I have a mix of RPi3 nodes running Raspian Stretch, and the RPi4 running Raspian Buster, I ended up building latest SLURM from sources.

That said, logs are your friend.

AZAZAZ
Posts: 11
Joined: Fri Jul 05, 2019 6:38 pm

Re: SLURM enabled but not starting

Tue Jul 16, 2019 5:34 pm

Attached is the slurm.conf file, could you see if there is anything wrong? Also, I know that munge works on all of the nodes.

Code: Select all

#ControlMachine=Masternode
#ControlAddr=ip.ip.ip
SlurmctldHost=ip.ip.ip

#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=Masternode NodeAddr=ip CPUs=4 State=UNKNOWN
NodeName=Node01 NodeAddr=ip CPUs=4 State=UNKNOWN
NodeName=Node02 NodeAddr=ip CPUs=4 State=UNKNOWN
NodeName=Node03 NodeAddr=ip CPUs=4 State=UNKNOWN
PartitionName=mycluster Nodes=Node[01-03] Default=YES MaxTime=INFINITE State=UP


drllama
Posts: 11
Joined: Thu Jun 27, 2019 2:05 am

Re: SLURM enabled but not starting

Tue Jul 16, 2019 5:38 pm

One thing to double check is: What version of SLURM do you have installed?

Stretch has 16.X, whereas Buster has 18.X. slurm.conf has changed a fair bit between those two releases.

Grab a copy of the sample file in /usr/share/doc/slurm (uh, I think, going from memory here), and use it as the basis for your slurm.conf.

Cheers,
Bruce.

AZAZAZ
Posts: 11
Joined: Fri Jul 05, 2019 6:38 pm

Re: SLURM enabled but not starting

Tue Jul 16, 2019 5:59 pm

I currently have slurm version 18.08 installed. I looked for the slurm sample file, but cant find it. Are there any samples online?

drllama
Posts: 11
Joined: Thu Jun 27, 2019 2:05 am

Re: SLURM enabled but not starting

Wed Jul 17, 2019 3:06 am

If you install from the repos, you should find the examples under /usr/share/doc/slurm*

I'd grab a copy for you, except I'm using 19.X now, compiled on my Pi's and on my big server which is acting as the controller.

Cheers,
Bruce.

Maphus
Posts: 1
Joined: Sat Jul 20, 2019 4:32 am

Re: SLURM enabled but not starting

Sat Jul 20, 2019 4:39 am

I ran into the same problem with Buster and Slurm version 18.08. After looking at the log file, a solution that worked for me was changing the ProcktrackType to proctrack/linuxproc.

I hope this helps!

Return to “Beginners”