Revisions of slurm

Dominique Leuenberger's avatar Dominique Leuenberger (dimstar_suse) accepted request 1220076 from Egbert Eich's avatar Egbert Eich (eeich) (revision 108)
- Update to version 24.05.4 & fix for CVE-2024-48936.
  * Fix generic int sort functions.
  * Fix user look up using possible unrealized uid in the dbd.
  * `slurmrestd` - Fix regressions that allowed `slurmrestd` to
    be run as SlurmUser when `SlurmUser` was not root.
  * mpi/pmix fix race conditions with het jobs at step start/end
    which could make srun to hang.
  * Fix not showing some `SelectTypeParameters` in `scontrol show
    config`.
  * Avoid assert when dumping removed certain fields in JSON/YAML.
  * Improve how shards are scheduled with affinity in mind.
  * Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
    is set in the same QOS.
  * Prevent backfill from planning jobs that use overlapping
    resources for the same time slot if the job's time limit is
    less than `bf_resolution`.
  * Fix memory leak when requesting typed gres and
    `--[cpus|mem]-per-gpu`.
  * Prevent backfill from breaking out due to "system state
    changed" every 30 seconds if reservations use `REPLACE` or
   `REPLACE_DOWN` flags.
  * `slurmrestd` - Make sure that scheduler_unset parameter defaults
    to true even when the following flags are also set:
    `show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
    `run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
    `disable_wait_for_result`, `usage_time_as_submit_time`,
    `show_batch_script`, and or `show_job_environment`. Additionaly,
    always make sure show_duplicates and
    `disable_truncate_usage_time` default to true when the following
    flags are also set: `scheduler_unset`, `scheduled_on_submit`, (forwarded request 1220075 from eeich)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1217321 from Egbert Eich's avatar Egbert Eich (eeich) (revision 107)
- Add %(?%sysusers_requires} to slurm-config.
  This fixes issues when building against Slurm. (forwarded request 1217300 from eeich)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1208086 from Egbert Eich's avatar Egbert Eich (eeich) (revision 106)
- Update to version 24.05.3
  * `data_parser/v0.0.40` - Added field descriptions.
  * `slurmrestd` - Avoid creating new slurmdbd connection per request
    to `* /slurm/slurmctld/*/*` endpoints.
  * Fix compilation issue with `switch/hpe_slingshot` plugin.
  * Fix gres per task allocation with threads-per-core.
  * `data_parser/v0.0.41` - Added field descriptions.
  * `slurmrestd` - Change back generated OpenAPI schema for
    `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
    parameters for request. `slurmrestd` will continue accept endpoint
    requests via `RequestBody` or HTTP query.
  * `topology/tree` - Fix issues with switch distance optimization.
  * Fix potential segfault of secondary `slurmctld` when falling back
    to the primary when running with a `JobComp` plugin.
  * Enable `--json`/`--yaml=v0.0.39` options on client commands to
    dump data using data_parser/v0.0.39 instead or outputting nothing.
  * `switch/hpe_slingshot` - Fix issue that could result in a 0 length
    state file.
  * Fix unnecessary message protocol downgrade for unregistered nodes.
  * Fix unnecessarily packing alias addrs when terminating jobs with
    a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
    nodes.
  * `accounting_storage/mysql` - Fix issue when deleting a qos that
    could remove too many commas from the qos and/or delta_qos fields
    of the assoc table.
  * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
  * Fix allowing access to reservations without `MaxStartDelay` set.
  * Fix regression introduced in 24.05.0rc1 breaking
    `srun --send-libs` parsing.
  * Fix slurmd vsize memory leak when using job submission/allocation
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1151965 from Egbert Eich's avatar Egbert Eich (eeich) (revision 104)
- Update to version 23.11.03
  * slurmrestd - Reject single http query with multiple path
    requests.
  * Fix launching Singularity v4.x containers with
    `srun --container` by setting .process.terminal to true in
    generated `config.json` when step has pseudoterminal (`--pty`)
    requested.
  * Fix loading in `dyanmic/cloud` node jobs after `net_cred`
    expired.
  * Fix cgroup null path error on `slurmd/slurmstepd` tear down.
  * `data_parser/v0.0.40` - Prevent failure if accounting is
    disabled, instead issue a warning if needed data from the
    database can not be retrieved.
  * `openapi/slurmctld` - Prevent failure if accounting is disabled.
  * Prevent `slurmscriptd` processing delays from blocking other
    threads in `slurmctld` while trying to launch various scripts.
    This is additional work for a fix in 23.02.6.
  * Fix memory leak when receiving alias addrs from controller.
  * `scontrol` - Accept `scontrol token lifespan=infinite` to
    create tokens that effectively do not expire.
  * Avoid errors when Slurmdb accounting disabled when `--json` or
    `--yaml` is invoked with CLI commands and `slurmrestd`. Add
    warnings when query would have populated data from Slurmdb
    instead of errors.
  * Fix `slurmctld` memory leak when running job with
    `--tres-per-task=gres:shard:#`
  * Fix backfill trying to start jobs outside of backfill window.
  * Fix oversubscription on partitions with `PreemptMode=OFF`.
  * Preserve node reason on power up if the node is downed
    or drained. (forwarded request 1150524 from eeich)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1141442 from Egbert Eich's avatar Egbert Eich (eeich) (revision 103)
- Update to 23.11.1 with following major improvements and fixing
  CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936
  and CVE-2023-49937
  * Substantially overhauled the SlurmDBD association management
    code. For clusters updated to 23.11, account and user
    additions or removals are significantly faster than in prior
    releases.
  * Overhauled `scontrol reconfigure` to prevent configuration
    mistakes from disabling slurmctld and slurmd. Instead, an
    error will be returned, and the running configuration will
    persist. This does require updates to the systemd service
    files to use the `--systemd` option to `slurmctld` and `slurmd`.
  * Added a new internal `auth/cred` plugin - `auth/slurm`. This
    builds off the prior `auth/jwt` model, and permits operation
    of the `slurmdbd` and `slurmctld` without access to full
    directory information with a suitable configuration.
  * Added a new `--external-launcher` option to `srun`, which is
    automatically set by common MPI launcher implementations and
    ensures processes using those non-srun launchers have full
    access to all resources allocated on each node.
  * Reworked the dynamic/cloud modes of operation to allow for
    "fanout" - where Slurm communication can be automatically
    offloaded to compute nodes for increased cluster scalability.
  * Overhauled and extended the Reservation subsystem to allow
    for most of the same resource requirements as are placed on
    the job. Notably, this permits reservations to now reserve
    GRES directly.
- Details of changes:
  * Fix `scontrol update job=... TimeLimit+=/-=` when used with a
    raw JobId of job array element.
Dominique Leuenberger's avatar Dominique Leuenberger (dimstar_suse) accepted request 1137045 from Egbert Eich's avatar Egbert Eich (eeich) (revision 102)
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
  bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
  CVE-2023-49938 - bsc#1218053)
  * Security Fixes:
    + Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
    + `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
    + `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
      and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
    + `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
    + Remove error message from `assoc_mgr_update_assocs` when purposefully
      resetting the default QOS.
  * Bug Fixes:
    + `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
      return from slurm_nss due to an error during lookup.
    + Fix job requests with `--tres-per-task` sometimes resulting in bad
      allocations that cannot run subsequent job steps.
    + Fix issue with `slurmd` where `srun` fails to be warned when a node
      prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
    + `gres/shard` - Fix plugin functions to have matching parameter orders.
    + `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
      constrained to a job
    + `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
      being used in a single job for certain MIG configurations
    + Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
      with DCMI devices.
    + `sview` - avoid crash when job has a node list string > 49 characters.
    + Prevent `slurmctld` crash during reconfigure when packing job start
      messages.
    + Preserve reason uid on reconfig.
    + Update node reason with updated `INVAL` state reason if different from (forwarded request 1136624 from eeich)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1130097 from Egbert Eich's avatar Egbert Eich (eeich) (revision 101)
- Add missing service file for slurmrestd (boo#1217711). (forwarded request 1130096 from eeich)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1129192 from Factory Maintainer's avatar Factory Maintainer (factory-maintainer) (revision 100)
Automatic submission by obs-autosubmit
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1123596 from Egbert Eich's avatar Egbert Eich (eeich) (revision 99)
- Add missing dependencies to slurm-config to plugins package.
  These should help to tie down the slurm version and help to avoid
  a package mix (bsc#1216869). (forwarded request 1123595 from eeich)
Dominique Leuenberger's avatar Dominique Leuenberger (dimstar_suse) accepted request 1121548 from Factory Maintainer's avatar Factory Maintainer (factory-maintainer) (revision 98)
Automatic submission by obs-autosubmit
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1118220 from Egbert Eich's avatar Egbert Eich (eeich) (revision 97)
- update to 23.02.6 to fix (CVE-2023-41914, bsc#1216207)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1117163 from Egbert Eich's avatar Egbert Eich (eeich) (revision 96)
- update to 23.02.6 to fix (CVE-2023-41914)
  * Removed Fix-test-32.8.patch as fixed upstream
  * Bug Fixes:
    + Fix `CpusPerTres=` not upgreadable with scontrol update
    + Fix unintentional gres removal when validating the gres job state.
    + Fix `--without-hpe-slingshot` configure option.
    + Fix cgroup v2 memory calculations when transparent huge pages are used.
    + Fix parsing of `sgather --timeout` option.
    + Fix regression from 22.05.0 that caused `srun --cpu-bind "=verbose"`
      and `"=v"` options give different CPU bind masks.
    + Fix "_find_node_record: lookup failure for node" error message appearing
      for all dynamic nodes during reconfigure.
    + Avoid segfault if loading serializer plugin fails.
    + `slurmrestd` - Correct OpenAPI format for `GET /slurm/v0.0.39/licenses`.
    + `slurmrestd` - Correct OpenAPI format for
      `GET /slurm/v0.0.39/job/{job_id}`.
    + `slurmrestd` - Change format to multiple fields in
     'GET /slurmdb/v0.0.39/assocations` and `GET /slurmdb/v0.0.39/qos` to
      handle infinite and unset states.
    + When a node fails in a job with `--no-kill`, preserve the extern step on the
      remaining nodes to avoid breaking features that rely on the extern step
      such as `pam_slurm_adopt`, `x11`, and `job_container/tmpfs`.
    + `auth/jwt` - Ignore `x5c` field in JWKS files.
    + `auth/jwt` - Treat 'alg' field as optional in JWKS files.
    + Allow job_desc.selinux_context to be read from the job_submit.lua script.
    + Skip check in slurmstepd that causes a large number of errors in the
      munge log: "Unauthorized credential for client UID=0 GID=0".
      This error will still appear on `slurmd`/`slurmctld`/`slurmdbd` start up
      and is not a cause for concern.
    + `slurmctld` - Allow startup with zero partitions.
Dominique Leuenberger's avatar Dominique Leuenberger (dimstar_suse) accepted request 1111943 from Egbert Eich's avatar Egbert Eich (eeich) (revision 95)
- Updated to version 23.02.5 with the following changes:
  * Bug Fixes:
    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
      job's environment when `--ntasks-per-node` was requested.
      The method that is is being set, however, is different and should be more
      accurate in more situations.
    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
      new behavior of the pmix plugin in 23.02.0. Note that neither of these
      plugins makes use of the `MpiParams=ports=` option, and previously
      were only limited by the systems ephemeral port range.
    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
      a node features plugin is configured.
    + Fix and prevent reoccurring reservations from overlapping.
    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
      `--mem-per-cpu`.
    + Fix a regression from 22.05.7 in which some jobs were allocated too few
      nodes, thus overcommitting cpus to some tasks.
    + Fix a job being stuck in the completing state if the job ends while the
      primary controller is down or unresponsive and the backup controller has
      not yet taken over.
    + Fix `slurmctld` segfault when a node registers with a configured
      `CpuSpecList` while `slurmctld` configuration has the node without
      `CpuSpecList`.
    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
      not registering by `ResumeTimeout`.
    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
      getting skipped.
    + Fix scontrol segfault when 'completing' command requested repeatedly in
      interactive mode.
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1110422 from Egbert Eich's avatar Egbert Eich (eeich) (revision 94)
- Create a macro for upgrade dependency to ensure uniform handling. (forwarded request 1110421 from eeich)
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1110259 from Egbert Eich's avatar Egbert Eich (eeich) (revision 93)
- Updated to 23.02.4 with the following changes:
  * Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1109308 from Egbert Eich's avatar Egbert Eich (eeich) (revision 92)
- Fixes since 23.02.03:
  Highlights:
  * Fix main scheduler loop not starting after a failover to backup controller.
  * Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
    (bsc#1214983).
  Other:
  * Fix sbatch return code when `--wait` is requested on a job array.
  * Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
  * Fix `slurmrestd` handling of job hold/release operations.
  * Make spank `S_JOB_ARGV` item value hold the requested command `argv`
    instead of the `srun --bcast` value when `--bcast` requested (only in local
    context).
  * Fix step running indefinitely when slurmctld takes more than
    `MessageTimeout` to respond. Now, slurmctld will cancel the step when
    detected, preventing following steps from getting stuck waiting for
    resources to be released.
  * Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
    requesting a job with `--ntasks-per-node`.
  * Fix handling of `ArrayTaskThrottle` in backfill.
  * Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
    reconfigure. Gres changes in the configuration were not updated on slurmctld
    startup. On startup or reconfigure, these messages were present in the log:
    `"error: Attempt to change gres/gpu Count`".
  * Fix potential double count of gres when dealing with limits.
  * Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
  * Fixed an issue where jobs requesting licenses were incorrectly rejected.
  * `scrontab` - Fix cutting off the final character of quoted variables.
  * `smail` - Fix issues where e-mails at job completion were not being sent.
  * `scontrol/slurmctld` - fix comma parsing when updating a reservation's
    nodes.
Ana Guerrero's avatar Ana Guerrero (anag+factory) accepted request 1109029 from Christian Goll's avatar Christian Goll (mslacken) (revision 91)
- updated to 23.02.04 which includes following changes: 
  * fixing the main scheduler loop not starting on the backup controller after
    a failover event, a segfault when attempting to use
  * AccountingStorageExternalHost, and an issue where steps could continue
    running indefinitely if the slurmctld takes too long to respond (bsc#1214983)
  * include a fix for a potential slurmctld crashes when the backup slurmctld
    takes over.
  * This also fixes some issues when using older versions of the command line
    tools with a 23.02 controller.
  * srun/sbatch/salloc - In order to support user namespaces, process user and
    group ids are no longer used unless explicitly requested as an argument and
    are left as nobody(99) by default. Any cli_filters or SPANK plugins need to
    ignore any uid or gid that equal SLURM_AUTH_NOBODY (99). User and group ids
    are now resolved by the active auth plugin. To determine the actual job uid
    or gid you should use the RESPONSE_RESOURCE_ALLOCATION RPC.
- removed Fix-test-3.13.patch as fixed upstream
- removed Fix-test-38.11.patch as test changed upstream (forwarded request 1109009 from mslacken)
Dominique Leuenberger's avatar Dominique Leuenberger (dimstar_suse) accepted request 1083466 from Egbert Eich's avatar Egbert Eich (eeich) (revision 89)
- Web-configurator: changed presets to SUSE defaults.
- If %_restart_on_update is no longer defined replace by own
  macro.
- Marked slurm-openlava, slurm-seff and slurm-sjstat noarch.
- rpmlint:
  * dropped some rpmlint filters which are no longer relevant.
  * added/refreshed filters. For Details, see rpmlintrc.
- Remove workaround to fix the restart issue in an Slurm package
  described in bsc#1088693.
  The Slurm version in this package as 16.05. Any attempt to
  directly migrate to the current version is bound to fail
  anyway.
- Now require slurm-munge if munge authentication is installed.
Displaying revisions 1 - 20 of 108
openSUSE Build Service is sponsored by