5 November 2014 henry2.
Datacenter where cluster is located had a loss of cooling. All compute nodes have been shutdown. Several disk array controllers are in failed states and some arrays shutdown automatically while file systems were active. No estimate of recovery time is available at this point.
Efforts continue to recover the file systems. Once /gpfs_share, /share, /share2, and /share3 have been recovered the compute nodes will begin to be brought back online. These are large file systems and it is not known how much longer checking them will require
Recovery efforts suffered a setback this morning when cooling to the data center was interrupted for a second time. In process of re-scanning file systems. Working with goal to have GPFS file systems and compute nodes available Monday 11/10. Were close to bringing compute nodes online when the cooling failure occurred this morning.
Checks of the GPFS file systems are complete and compute nodes have mostly been restarted. Earlier issues with some login nodes and web forms for new project and account requests have been resolved. LSF queues are active and hosts open to accept jobs
We sincerely apologize for this extended outage
3 November 2014 LSF
Subsequent to the LSF upgrade MPICH2
hydra binding is generating errors like
TS: sendRegisterTask: nb_tcpConnect from (nxxx-x) failed. Bad host name
As a temporary work around - replace mpiexec_hydra command line in LSF job scripts with
set tasks = `mkmach.pl mach` mpihydra -n $tasks -f mach -bootstrap-exec blaunch ./mpi-hellomkmach.pl is in the default path (in 8.3 LSF etc directory) Obviously replacing ./mpi-hello with your executable and the path to it. In the above example, the path "./" refers to working directory. If your executable mpi-hello is not in working directory, then you need to add one more line to resolve path myexepath to mpi-hello and add $myexepath before mpi-hello as the following
set myexepath = `which mpi-hello` set tasks = `mkmach.pl mach` mpihydra -n $tasks -f mach -bootstrap-exec blaunch $myexepath/mpi-hello
Will provide additional update when longer term solution is available
2 November 2014 LSF
Due to ongoing errors following the Oct 27 LSF
issues and that the 7.06 version of LSF was no
longer supported, LSF was updated to version 8.3
27 October 2014 LSF
LSF master batch daemon experienced problem restarting.
No b* commands working on the cluster.
Approximately 11:30pm LSF master batch daemon was restarted
and LSF service resumed working.
15 October 2014 Ansys default version
Default Ansys version (obtained using add ansys command) has been changed to 14.5 (from 13.0). Version 13.0 can still be accessed using command add ansys-130)
28 August 2014 /share2
Currently write operations on /share2 file set are
not completing. Issue is being investigated.
5 August 2014 LSF
LSF is currently not working on the cluster due to side effects from earlier file system problems. Working to resolve these issues.
9 July 2014 License server issues
The license server for many of the licenses used on
cluster is having issues and licenses are not available.
This likely will impact many areas of cluster operation
including LSF. Issue is being investigated and the license
server will be rebooted.
13 June 2014 henry2 file system access
During the evening of Saturday June 21 ComTech will be
rebooting the network switches that connect HPC nodes
and storage servers to apply a required software update.
This will cause an approximate 5 minute network interruption for each switch. Due to the length of the interruption it is likely that file I/O attempted during the interruption will time out and fail.
Please plan job submissions with expectation that file I/O performed Saturday (June 21) evening will likely fail.
30 May 2014 /share* directories
The /share* directories are going to be moved to new storage.
- /share read only starting noon Friday June 6
- /share2 read only starting noon Friday June 13
- /share3 read only starting noon Friday June 20
To enable this move the /share* directories will be changed to read only state, the data synced between current storage and new storage, and then checksums computed to ensure complete and correct transfer. The /share* file systems will be in read only state one at a time with the following planned schedule:
4 April 2014 /ncsu/volume1
Quotas for users and groups using /ncsu/volume1 have been enabled as a
temporary, emergency measure. The tape library which supports this file
system has essentially reached it current capacity limit.
This tape library is also used for backup on contents of the SONAS storage system that is being replaced. As file sets are migrated to the new storage system the backup will move from on campus to off campus freeing capacity in tape library. As capacity becomes available the quotas will be eased.
For the moment the quotas will effectively prevent further movement of data to /ncsu/volume1. We regret this inconvenience and will be working to allow free capacity in the tape library as quickly as feasible.
17 February 2014 henry2 cluster
Access to the cluster is currently unavailable. Login attempts hang
after entering password. Working to identify and resolve this issue.
13 January 2014 ANSYS and CFX default versions
The versions of ANSYS and CFX that are accessed using the
add cfxhave been updated from version 13 to version 14.5
31 December 2013 sam cluster
HPC service running from sam cluster at MCNC will be
retired effective 31 December 2013.
2 December 2013 HPC Website
Web site URL changed from www.ncsu.edu/itd/hpc to www.ncsu.edu/hpc - causing many pages with imbeded URLs to break. Please report any broken pages that you encounter.
8 November 2013 /ncsu/volume1
/ncsu/volume1 is currently unavailable. Issue is
7 October 2013 SONAS
8 October update - As of ~10:30am the file sets provided by SONAS are available again from the HPC cluster.
SONAS storage system had multiple concurrent hard drive failures within a single array. This caused the system to enter a data protection mode that is not allowing access from the henry2 cluster.
Initial attempts to rebuild the array by mapping in spare drives has failed. Currently work is underway to copy the data from the faulted array to other storage within the system. So far there is no indication from the vendor that there is any evidence of data loss.
There is no current estimate of when system will be restored. Currently we are not trying to remount any file sets to allow the full capabilities of the system to be devoted to restoration of the faulted array.
The following file systems that are served by this storage system are currently unavailable [note that file sets named *_data have daily backups whereas file sets named *share* are scratch file sets without backup]:
/share /share2 /share3 /kelley_data /abdelkhalik_data /dgrp_share /mobyle_share /edwards_data /edwards_data2 /xie_data /gubbins_data /stormtrack_data /zhang_share /zhang_data /zhang_data7 /zhang_data8 /meskhidze_data /zhang_data9 /kim_data /whetten_data /zhang_data10
12 September 2013 henry2 /home
Disk errors have again caused /home file system to
/home restored temporarily from backup to newer hardware. Previous /home is available read-only as /old_home
A future outage will be schedule to move to a more resilient /home that can be recoverd more quickly after failures.
9 September 2013 henry2 - /share3 file system.
- The /share3 file system
will be "read only" from around 8 AM, Monday September 9 untill
sometime on Wednesday 11 September. It is being migrated to newer
disks. /share and /share2 (as well as other file systems) will
still be available for writing.
6 September 2013 henry2 /home
Disk errors have caused the /home file system on
henry2 cluster to automatically change to read-only
Storage server for /home being rebooted (3pm EDT)
New LSF jobs will not work until /home is restored to read-write state.
22 August 2013 henry2 - some nodes unavailable
Power outage affected campus chilled water loop.
Due to temperature rise in data center some nodes automatically
shut down to avoid overheating. As of 2:30 PM not all are
rebooted. Some infiniband and 10GiGE partner queues are impacted.
9 August 2013 henry2 - LSF
- LSF was not working and could not be restarted.
Master LSF server had to be rebooted. This server also
supports many of the license servers for henry2 cluster.
Some license servers did not restart correctly after reboot
and are being restarted manually.
29 July 2013 share file system.
- The /share file system
will be "read only" from around 8 AM, Monday July 29 till
sometime on Tuesday 30 July. It is being migrated to newer
disks. /share2 and /share3 (as well as other file systems) will
still be available for writing.
/share is not available read/write from the newer storage system
2 August update - /share is writable.
27 September 2012 sam cluster
Will be working on migrating to new LSF version
on sam cluster between now and mid-October.
There may be occasional interruptions to scheduling
of new jobs. Running jobs should not be impacted.
24 June 2012 henry2 /share2
/share 2 (and other file systems provided by SONAS storage system)
will be unavailable Sunday 24 June for hardware update.
On HPC henry2 cluster affected file systems are
/share2 /kelley_data /zhang_share /zhang_data /abdelkhalik_data /dgrp_share /mobyle_share /edwards_data /xie_data /gubbins_data /stormtrack_data /zhang_data7 /zhang_data8 /meskhidze_data /zhang_data9
22 June 2012 henry2 /gpfs_share
Update 30 June
Extensive efforts to recover the contents of /gpfs_share have been unsuccessful. While it is possible to list directory contents for most directories all attempts to access file contents have failed. We deeply regret the harm that loss of this file system's contents will cause.
We are proceeding with creation of new /gpfs_share using new hardware.. however this event was not hardware related, current /gpfs_share hardware is working normally. This event was due to corruption of the file system contents. We will continue to work with IBM to attempt to identify how this data corruption occurred.
/gpfs_share file system has been seriously damaged. Efforts are underway to attempt to determine if any data can be recovered. /gpfs_share is a scratch file system.. so there is no backup.
Two login nodes experienced exceptionally heavy load and became unresponsive. These login nodes had to be forcibly shut down and restarted. Following restart two directories on /gpfs_share displayed signs of serious damage. A file system check was run on /gpfs_share - the check failed to complete, but did report indications of additional data corruption.
/gpfs_share is unavailable while we work to determine if any data is recoverable.
3 June 2012 henry2 /share2
/share 2 (and other file systems provided by SONAS storage system)
will be unavailable Sunday 3 June for system update.
17 May 2012 R
Default version of R on henry2 cluster has been updated from version 2.11
to 2.14 (version accessed using the
add Rcommand. Version 2-11 can still be accessed using
2-3 May 2012 henry2 GPFS file systems
To prepare for moving /gpfs_share to new hardware
the version of GPFS on file server nodes will be
updated to version 3.3. All GPFS file system will
be unavailable Wednesday May 2 and Thursday May 3.
May 3 update - some GPFS file systems are online again. Files can be accessed, but jobs should not be started from these file systems yet. There will be another disruption when /gpfs_share update completes to finalize the update to version 3.3
9 April 2012 login03
ssh key for host login03 was updated today
new DSA public key is
14 February 2012 Matlab
Default version of Matlab on henry2 cluster has been updated from 2009b
to 2011a (version accessed using the add matlab command. 2009b
can still be accessed using add matlab2009b.
31 October 2011 CFX
Default version of CFX (now also known as ANSYS CFD)
has been updated to v12.1 (version accessed using
add cfx command).
27 September 2011 henry2 /home and /usr/local
A drive failed in the disk array that holds /home and
/usr/local file systems.
While the array was rebuilding a second drive failure occurred resulting in loss of both file systems.
/home and /usr/local are being restored from backup. Due to size of these file systems this may take a day or two to complete.
Because LSF resides on /usr/local file system and requires /home to submit jobs cluster is essentially unusable until file systems are restored.
Logins have been disabled to allow file system restoration to proceed as rapidly as possible.
morning 28 Septemter 2011 Update
Restore of /usr/local is complete and work is underway to bring LSF back up and to restart various affected license servers.
evening 28 September 2011 Update
Restore of /home is about 70% complete. We anticipate that restore will finish sometime tomorrow.
evening 29 September 2011 Update
As of ~8pm /home restore is complete and ssh to login nodes is enabled. LSF queues remain inactive - jobs can be submitted but are not being started. Expect LSF to be ready to schedule new jobs by Friday morning.
30 September 2011 Update
All LSF queues were returned to active state by around noon on 30 September.
We sincerely regret the extensive inconvenience that resulted from this event.
2 September 2011 Matlab
Symbolic link /usr/local/apps/matlab was changed
to point to /usr/local/apps/matlab2009b. This is
consistient with the default version of matlab
accessed using the 'add matlab' command.
License file for Matlab was also updated.
16 August 2011 SONAS Maintenance
In preparation for Fall Semester there will be maintenance
performed on the SONAS storage system from 6pm to 8pm
to enable an
additional file sharing protocol. NFS access to SONAS file
systems may be unavailable during this time. These file
/share2 /kelley_data /abdelkhalik_data /dgrp_share /mobyle_share /edwards_data /xie_data /gubbins_data /stormtrack_data /zhang_data /zhang_data7 /zhang_data8 /meskhidze_data
16 August 2011 DL Poly
DL Poly has been updated to version 4.01
This version replaces previous versions 2 and 3.
27 June 2011 /share2
/share2 will be unavailable from 8am - 8pm on Monday
June 27. Storage controllers will be replaced in the
storage system proving /share2 file system. Due to unanticipated
complications during the controller replacement /share2
remained unavailable until about noon on Tuesday June 28.
17 June 2011 LSF
LSF files and executables reside on /usr/local. Following
unavailability of /usr/local LSF was not
functioning correctly. Restart of LSF was completed about 3:30pm.
17 June 2011 /usr/local
File server for /usr/local on henry2 is not currently
reachable - so /usr/local file system is not available
to the cluster. File server had run out of memory..
has been rebooted and /usr/local file system is again
10 June 2011 /share3
File server for /share3 file system will be replaced
with newer, higher capability server. Expect /share3
to be unavailable from 08:00-13:00 Friday 10 June.
25 May 2011 /share
File server for /share file system was replaced
with newer, higher capability server.
29 March 2011 /share3
A file system error occurred on /share3. The file system
was automatically remounted read-only. File system check is
being run to find and correct file system problems.
7 February 2011 henry2 32-bit login nodes
All henry2 cluster nodes will be running 64-bit Linux
by 7 February 2011. The login nodes reachable from
login.hpc.ncsu.edu will be reloaded with 64-bit linux
the morning of February 7. These login nodes will then
be running the same Linux as the login nodes reached from
28 January 2011 matlab
Default version of Matlab has been changed from 7 to 2009b
(version accessed using 'add matlab' command). Version 7
can still be accessed using 'add matlab7'.
10 January 2011 henry2 login node IP addresses
IP addresses for henry2 login nodes will change
starting early morning. There may be issues connecting
as DNS servers are updated with the new addresses.
9 December 2010 /share2
Storage system that houses /share2 file system experenced
problems. Vendor checked the system and /share2 is now back
availabe on all cluster nodes.
Please send report to firstname.lastname@example.org if you continue to see issues with /share2
3 November 2010 Amber 10
Latest patches were applied to Amber 10 and it has been recompiled.
14 October 2010 Espresso
Default version of Espresso has been changed from version 3.0 to version 4.2.1 (version accessed when using 'add espresso' command). Version 3.0 can be accessed using 'add espresso-3.0' command.
24 September 2010 Abaqus
Default version of Abaqus has been changed from version 6.8 to version 6.9-EF (version accessed when using 'abaqus' command). Version 6.8 can be accessed using 'abq684' command.
22 September 2010 Amber
Default version of Amber has been changed from version 8 to version 10
(version accessed when using 'add amber' command). Version 8 can be
accessed using 'add amber8'.
22 September 2010 henry2
Network access to henry2 cluster was lost for ~30 minutes
about 9:40am. Access has been restored, cause is under investigation.
23 June 2010 henry2
LSF host definitions were modified so that bhosts
command now shows a condensed list of resources - summarized
by blade chassis. To see listing for all nodes use the
command bhosts -X
23 June 2010 /share2
/share2 file system will be unavailable from approximately 9am
until approximately 5pm due to a hardware upgrade.
21 June 2010 Default PGI Compilers
Using the commands add pgi or add pgi64
will now set up your environment to use version 10.5 PGI
compilers and MPICH MPI libraries as provided by PGI.
Since the update to Red Hat Enterprise Linux 5, some MPI codes, particularly codes using 32 or more processors, have encountered P4 net_send errors. P4 is the underlying communications software used by MPICH. Neither P4 nor MPICH have been maintained for some time. MPICH development has moved to MPICH2 and uses other communications software.
MPICH2 libraries using hydra for process management are available and have been effective at resolving the net_send errors. Use the commnad add pgi64_hydra to configure your environment to use PGI 64-bit compiler with MPICH2 libraries. Note that currently only 64-bit version is supported (so needs to be used from login64.hpc.ncsu.edu).
When using MPICH2 replace mpiexec with mpiexec_hydra to run MPI jobs. mpiexec_hydra will start the MPI job tasks under control of LSF scheduler.
25 May 2010 PGI Compilers
Portland Group Compilers version 10.5 are now available
on henry2 cluster.
Use command add pgi10-5 to configure your
environment to use the new version.
19 May 2010 henry2
Nearly all nodes have been reloaded and all queues are activated.
Please report any problems encountered to email@example.com
17 May 2010 henry2
Reloading of henry2 compute nodes with new Linux version
continues. Several hunderd nodes are now available and
queues are begining to be reactivated.
Please report any problems encountered to firstname.lastname@example.org
16 May 2010 henry2
Due to work at the Sullivan substation power will be out to much of main
campus from 7am - 1pm. This power outage will affect the chiller plant that
provides significant amount of cooling for the data center where the henry2
cluster is located.
To avoid overheating the cluster nodes, henry2 cluster will be shut down for power outage. Access to some queues will begin to be restricted well ahead of the actual outage to allow jobs to complete prior to nodes being powered down.
We are planning to take this opportunity to update Linux version running on henry2 from RHEL4 to RHEL5. Sam cluster is already running RHEL5 and no incompatibilities have been observed. So, we expect the update to have minimal negative impacts. Primary impact will be that restarting the cluster following the power outage will be slower as each node will need to reload rather than just reboot.
15 May 2010 henry2
About 4pm a data center test in preparation for the planned power
outage on May 16 resulted in high temperatures around the henry2
cluster. A large number of henry2 nodes automatically shut down
due to the heat. Many running LSF jobs were lost.
We very much regret this event and the unexpected loss of work and are investigating procedures to prevent recurrence of this kind of event.
3 May 2010 /share2
A new shared, scratch file system, /share2, is now available
for use by all users. Like all /share file systems this space
is not backed up and may be subject to periodic purge.
16 February 2010 /ncsu/volume1 and /ncsu/volume2
Both /ncsu/volume1 and /ncsu/volume2 are again available
for read and write operations.
A daily backup is done for these file systems with a single backup copy retained for each file.
4 February 2010 /ncsu/volume2
There are file system errors on /ncsu/volume2.
It is being unmounted to run file system check.
A number of disks in the array holding /ncsu/volume2 experienced problems concurrently. Working with the vendor the disks have been restarted and data appears to be intact. A reconstruction process is running.
Following completion of reconstruction the file system will be made available read-only while a full backup is preformed. Once backup is completed file system will be available read/write.
4 January 2010 henry2 queues
Two new queues have been added to henry2 cluster
as a result of resources the college of Physical
and Mathematical Sciences added to the cluster.
Jobs submitted by accounts associated with projects
from PAMS that do not specify any other queue will be
automatically routed to the new PAMS queues.
Older Operational News
Last modified: November 10 2014 18:22:20.