wiki:HPC_maintenance

Scheduled Maintenance

General Policy

  • Scheduled maintenance always starts on Tuesdays in a normal working week.
    Hence the Monday is a regular workday and can be used to prepare the maintenance and following the maintenance the remainder of the week consists of regular workdays to resolve potential problems.
    The only exception is maintenance at the end of August, which may overlap with the last week of the summer break.
  • Maintenance is scheduled in two rounds with one month in between.
    In the first round one cluster with related infra is serviced followed by 3 weeks of acceptation testing.
    If the the acceptation tests were successful, the second cluster with related infra is serviced in the second round.
    When the infra serviced in the first round fails the acceptation tests, the second round will be delayed and rescheduled.
  • Which parts of a redundant setup are serviced in the first round and which ones in the second round is determined ~ 1 month before the first round.
  • All clusters have maintenance scheduled in a bi-annual interval.
  • Maintenance is executed using the checklist

UMCG/LifeLines Research Clusters

Infra involved

  • User accounts from 'umcg' and 'll' IDVault entitlements.
  • Calculon and Boxy clusters including UIs, nodes, shared storage and OpenStack cloud servers hosting o.a. scheduler and proxy VMs.
  • Separate interactive servers: Flexo & Bender.

Schedule

YearSeasonRoundDateInfra
2016 Summer One Aug 23 Calculon
Two Sep 20 Boxy
Three Sep 30 Calculon: no maintenance, but downtime due to backup-power test @ DUO data center
2017 Winter One Feb 07 Calculon
Two Mar 07 Boxy
Three Mar 09 Calculon, Flexo, Bender: network maintenance @ DUO data center
2017 Summer One Aug 22 Calculon, Flexo, Bender and Lobby
Two Sep 20 Boxy, Flexo, Bender and Foyer; Originally planned for Sep 19th, but postponed for one day
2018 Winter One Feb 06 Cancelled
Two Mar 06 Cancelled
2018 Summer One Aug 21 Calculon, Flexo, Bender and Lobby
Two Sep 18 Boxy and Foyer

Genome Diagnostics Clusters

Infra involved

  • User accounts from 'GD' IDVault entitlement.
  • Zinc-Finger and Leucine-Zipper clusters including UIs, nodes, schedulers, data sharing servers and shared storage.
  • Separate pre-processing servers: Gattaca*.

Schedule

Note: first round of scheduled maintenance always coincides with emergency power tests @ UMCG.

YearSeasonRoundDateInfra
2016 Fall Zero Sep 30 Gattaca01 + Leucine-Zipper: no maintenance, but downtime due to backup-power test @ DUO data center
One Oct 04 Gattaca02 + Zinc-Finger
Two Nov 08 Gattaca01 + Leucine-Zipper (Was originally November 1st, but is one week delayed.)
2017 Winter Extra Mar 09 Gattaca01 + Leucine-Zipper: network maintenance @ DUO data center
2017 Spring One Apr 04 Cancelled
Two May 02 Cancelled
2017 Fall One Oct 03 Gattaca01 + Leucine-Zipper
Two Oct 31 Gattaca02 + Zinc-Finger
2018 Spring One Jun 05 Gattaca01 + Leucine-Zipper (Delayed maintenance originally scheduled for Apr 10.)
Two Jul 04 Gattaca02 + Zinc-finger (Delayed maintenance originally scheduled for May 08.)
2018 Fall One Oct 02 T.B.A.
Two Oct 30 T.B.A.
2019 Spring One Apr 02 T.B.A.
Two May 07 T.B.A.
2019 Fall One Oct 01 T.B.A.
Two Oct 29 T.B.A.

Checklist

  • Create list of all machines that will be serviced during this round of maintenance. Use the infra catalogue to make sure the list is complete.
  • Create checklist of what needs to be performed for which machines on the RUG CIT Redmine Wiki
  • Determine which analysis pipelines may be affected by the maintenance and will require a verification or validation experiment.
    Add the required verification/validation experiments to the checklist on the RUG CIT Redmine Wiki
  • Announce maintenance on mailinglist
  • Perform maintenance
  • Check if all items of the checklist on the RUG CIT Redmine Wiki have been executed.
  • Check if all machines that can process Slurm jobs work as expected by submitting a CheckEnvironment.sh test job to each compute node that was affected by the maintenance.
    See Analysis SOP -> FAQs -> Q: How do I know what environment is available to my job on an execution host?
  • Perform verification/validation experiments as listed in the checklist on the RUG CIT Redmine Wiki and using the corresponding SOP from docportal for this round of maitenance, fill out the corresponding forms and send them to the product owners.
Last modified 3 weeks ago Last modified on 2018-09-03T10:47:30+02:00