Operations News
27 September 2012 sam cluster
-
Will be working on migrating to new LSF version
on sam cluster between now and mid-October.
There may be occasional interruptions to scheduling
of new jobs. Running jobs should not be impacted.
24 June 2012 henry2 /share2
-
/share 2 (and other file systems provided by SONAS storage system)
will be unavailable Sunday 24 June for hardware update.
On HPC henry2 cluster affected file systems are
/share2 /kelley_data /zhang_share /zhang_data /abdelkhalik_data /dgrp_share /mobyle_share /edwards_data /xie_data /gubbins_data /stormtrack_data /zhang_data7 /zhang_data8 /meskhidze_data /zhang_data9
22 June 2012 henry2 /gpfs_share
Update 30 June
Extensive efforts to recover the contents of /gpfs_share
have been unsuccessful. While it is possible to list
directory contents for most directories all attempts to
access file contents have failed. We deeply regret the
harm that loss of this file system's contents will cause.
We are proceeding with creation of new /gpfs_share using new hardware.. however this event was not hardware related, current /gpfs_share hardware is working normally. This event was due to corruption of the file system contents. We will continue to work with IBM to attempt to identify how this data corruption occurred.
/gpfs_share file system has been seriously damaged. Efforts are underway to attempt to determine if any data can be recovered. /gpfs_share is a scratch file system.. so there is no backup.
Two login nodes experienced exceptionally heavy load and became unresponsive. These login nodes had to be forcibly shut down and restarted. Following restart two directories on /gpfs_share displayed signs of serious damage. A file system check was run on /gpfs_share - the check failed to complete, but did report indications of additional data corruption.
/gpfs_share is unavailable while we work to determine if any data is recoverable.
3 June 2012 henry2 /share2
-
/share 2 (and other file systems provided by SONAS storage system)
will be unavailable Sunday 3 June for system update.
17 May 2012 R
-
Default version of R on henry2 cluster has been updated from version 2.11
to 2.14 (version accessed using the
add R command. Version 2-11
can still be accessed using add R-2.11.
2-3 May 2012 henry2 GPFS file systems
-
To prepare for moving /gpfs_share to new hardware
the version of GPFS on file server nodes will be
updated to version 3.3. All GPFS file system will
be unavailable Wednesday May 2 and Thursday May 3.
May 3 update - some GPFS file systems are online again. Files can be accessed, but jobs should not be started from these file systems yet. There will be another disruption when /gpfs_share update completes to finalize the update to version 3.3
9 April 2012 login03
-
ssh key for host login03 was updated today
new DSA public key is
ssh-dss AAAAB3NzaC1kc3MAAACBALMS1lnD+99/rhcvcBj3AJGuC3j5PgTtwZfNQpoGsaLNlGJ2usNN7OxFjik1gm1z4VUfrxJMlV7LCnu6upprpROhxk38qg+Mf1aWKOphFPk77c955apKSWZizBOo/NG//fDz/rrSEpjqsODqFiQ3wUjzlKpP9L38K3pijjjz/H2ZAAAAFQCksCY7vggU2IOeOKSSBHckp+SqoQAAAIB3vXpjuk8SEDxDP3VNLs1uxV83NznwNrAy6xILa2AEsnKhLktJCzCRmAuyRY8jrUQaXFXH1EZpQ1wz8SimfD+9CXtA7NQPYj99ectUqmt8icplxFqgTKtxXH2fevoPQbW/xnWyZMkmIlVQPNR7g3XrnxT3RtN93sfr94nBWeGcPgAAAIEAlKDLXU4wtxk4d/nzKBMaD7sJU07+Z25aysChz+987eG5SmdTpj69grpTBXtFMejYiHKFUXSDG+0xA+3rhLSija3xVcad6AtjBB3pH/WYQVPwXisLsKymEmXyxWSYDhJXoibSWlkst0xqzgDywjr1xL6kAuLG6e89Tc9/uNdonGw=
14 February 2012 Matlab
-
Default version of Matlab on henry2 cluster has been updated from 2009b
to 2011a (version accessed using the add matlab command. 2009b
can still be accessed using add matlab2009b.
31 October 2011 CFX
-
Default version of CFX (now also known as ANSYS CFD)
has been updated to v12.1 (version accessed using
add cfx command).
27 September 2011 henry2 /home and /usr/local
-
A drive failed in the disk array that holds /home and
/usr/local file systems.
While the array was rebuilding a second drive failure occurred resulting in loss of both file systems.
/home and /usr/local are being restored from backup. Due to size of these file systems this may take a day or two to complete.
Because LSF resides on /usr/local file system and requires /home to submit jobs cluster is essentially unusable until file systems are restored.
Logins have been disabled to allow file system restoration to proceed as rapidly as possible.
morning 28 Septemter 2011 Update
Restore of /usr/local is complete and work is underway to bring LSF back up and to restart various affected license servers.
evening 28 September 2011 Update
Restore of /home is about 70% complete. We anticipate that restore will finish sometime tomorrow.
evening 29 September 2011 Update
As of ~8pm /home restore is complete and ssh to login nodes is enabled. LSF queues remain inactive - jobs can be submitted but are not being started. Expect LSF to be ready to schedule new jobs by Friday morning.
30 September 2011 Update
All LSF queues were returned to active state by around noon on 30 September.
We sincerely regret the extensive inconvenience that resulted from this event.
2 September 2011 Matlab
-
Symbolic link /usr/local/apps/matlab was changed
to point to /usr/local/apps/matlab2009b. This is
consistient with the default version of matlab
accessed using the 'add matlab' command.
License file for Matlab was also updated.
16 August 2011 SONAS Maintenance
-
In preparation for Fall Semester there will be maintenance
performed on the SONAS storage system from 6pm to 8pm
to enable an
additional file sharing protocol. NFS access to SONAS file
systems may be unavailable during this time. These file
systems are:
/share2
/kelley_data
/abdelkhalik_data
/dgrp_share
/mobyle_share
/edwards_data
/xie_data
/gubbins_data
/stormtrack_data
/zhang_data
/zhang_data7
/zhang_data8
/meskhidze_data
16 August 2011 DL Poly
-
DL Poly has been updated to version 4.01
This version replaces previous versions 2 and 3.
27 June 2011 /share2
-
/share2 will be unavailable from 8am - 8pm on Monday
June 27. Storage controllers will be replaced in the
storage system proving /share2 file system. Due to unanticipated
complications during the controller replacement /share2
remained unavailable until about noon on Tuesday June 28.
17 June 2011 LSF
-
LSF files and executables reside on /usr/local. Following
unavailability of /usr/local LSF was not
functioning correctly. Restart of LSF was completed about 3:30pm.
17 June 2011 /usr/local
-
File server for /usr/local on henry2 is not currently
reachable - so /usr/local file system is not available
to the cluster. File server had run out of memory..
has been rebooted and /usr/local file system is again
available.
10 June 2011 /share3
-
File server for /share3 file system will be replaced
with newer, higher capability server. Expect /share3
to be unavailable from 08:00-13:00 Friday 10 June.
25 May 2011 /share
-
File server for /share file system was replaced
with newer, higher capability server.
29 March 2011 /share3
-
A file system error occurred on /share3. The file system
was automatically remounted read-only. File system check is
being run to find and correct file system problems.
7 February 2011 henry2 32-bit login nodes
-
All henry2 cluster nodes will be running 64-bit Linux
by 7 February 2011. The login nodes reachable from
login.hpc.ncsu.edu will be reloaded with 64-bit linux
the morning of February 7. These login nodes will then
be running the same Linux as the login nodes reached from
login64.hpc.ncsu.edu
28 January 2011 matlab
-
Default version of Matlab has been changed from 7 to 2009b
(version accessed using 'add matlab' command). Version 7
can still be accessed using 'add matlab7'.
10 January 2011 henry2 login node IP addresses
-
IP addresses for henry2 login nodes will change
starting early morning. There may be issues connecting
as DNS servers are updated with the new addresses.
9 December 2010 /share2
-
Storage system that houses /share2 file system experenced
problems. Vendor checked the system and /share2 is now back
availabe on all cluster nodes.
Please send report to oit_hpc@help.ncsu.edu if you continue to see issues with /share2
3 November 2010 Amber 10
-
Latest patches were applied to Amber 10 and it has been recompiled.
14 October 2010 Espresso
-
Default version of Espresso has been changed from version 3.0 to version 4.2.1 (version accessed when using 'add espresso' command). Version 3.0 can be accessed using 'add espresso-3.0' command.
24 September 2010 Abaqus
-
Default version of Abaqus has been changed from version 6.8 to version 6.9-EF (version accessed when using 'abaqus' command). Version 6.8 can be accessed using 'abq684' command.
22 September 2010 Amber
-
Default version of Amber has been changed from version 8 to version 10
(version accessed when using 'add amber' command). Version 8 can be
accessed using 'add amber8'.
22 September 2010 henry2
-
Network access to henry2 cluster was lost for ~30 minutes
about 9:40am. Access has been restored, cause is under investigation.
23 June 2010 henry2
-
LSF host definitions were modified so that bhosts
command now shows a condensed list of resources - summarized
by blade chassis. To see listing for all nodes use the
command bhosts -X
23 June 2010 /share2
-
/share2 file system will be unavailable from approximately 9am
until approximately 5pm due to a hardware upgrade.
21 June 2010 Default PGI Compilers
-
Using the commands add pgi or add pgi64
will now set up your environment to use version 10.5 PGI
compilers and MPICH MPI libraries as provided by PGI.
Since the update to Red Hat Enterprise Linux 5, some MPI codes, particularly codes using 32 or more processors, have encountered P4 net_send errors. P4 is the underlying communications software used by MPICH. Neither P4 nor MPICH have been maintained for some time. MPICH development has moved to MPICH2 and uses other communications software.
MPICH2 libraries using hydra for process management are available and have been effective at resolving the net_send errors. Use the commnad add pgi64_hydra to configure your environment to use PGI 64-bit compiler with MPICH2 libraries. Note that currently only 64-bit version is supported (so needs to be used from login64.hpc.ncsu.edu).
When using MPICH2 replace mpiexec with mpiexec_hydra to run MPI jobs. mpiexec_hydra will start the MPI job tasks under control of LSF scheduler.
25 May 2010 PGI Compilers
-
Portland Group Compilers version 10.5 are now available
on henry2 cluster.
Use command add pgi10-5 to configure your
environment to use the new version.
19 May 2010 henry2
-
Nearly all nodes have been reloaded and all queues are activated.
Please report any problems encountered to oit_hpc@help.ncsu.edu
17 May 2010 henry2
-
Reloading of henry2 compute nodes with new Linux version
continues. Several hunderd nodes are now available and
queues are begining to be reactivated.
Please report any problems encountered to oit_hpc@help.ncsu.edu
16 May 2010 henry2
-
Due to work at the Sullivan substation power will be out to much of main
campus from 7am - 1pm. This power outage will affect the chiller plant that
provides significant amount of cooling for the data center where the henry2
cluster is located.
To avoid overheating the cluster nodes, henry2 cluster will be shut down for power outage. Access to some queues will begin to be restricted well ahead of the actual outage to allow jobs to complete prior to nodes being powered down.
We are planning to take this opportunity to update Linux version running on henry2 from RHEL4 to RHEL5. Sam cluster is already running RHEL5 and no incompatibilities have been observed. So, we expect the update to have minimal negative impacts. Primary impact will be that restarting the cluster following the power outage will be slower as each node will need to reload rather than just reboot.
15 May 2010 henry2
-
About 4pm a data center test in preparation for the planned power
outage on May 16 resulted in high temperatures around the henry2
cluster. A large number of henry2 nodes automatically shut down
due to the heat. Many running LSF jobs were lost.
We very much regret this event and the unexpected loss of work and are investigating procedures to prevent recurrence of this kind of event.
3 May 2010 /share2
-
A new shared, scratch file system, /share2, is now available
for use by all users. Like all /share file systems this space
is not backed up and may be subject to periodic purge.
16 February 2010 /ncsu/volume1 and /ncsu/volume2
-
Both /ncsu/volume1 and /ncsu/volume2 are again available
for read and write operations.
A daily backup is done for these file systems with a single backup copy retained for each file.
4 February 2010 /ncsu/volume2
-
There are file system errors on /ncsu/volume2.
It is being unmounted to run file system check.
A number of disks in the array holding /ncsu/volume2 experienced problems concurrently. Working with the vendor the disks have been restarted and data appears to be intact. A reconstruction process is running.
Following completion of reconstruction the file system will be made available read-only while a full backup is preformed. Once backup is completed file system will be available read/write.
4 January 2010 henry2 queues
-
Two new queues have been added to henry2 cluster
as a result of resources the college of Physical
and Mathematical Sciences added to the cluster.
Jobs submitted by accounts associated with projects
from PAMS that do not specify any other queue will be
automatically routed to the new PAMS queues.
Older Operational News
Last modified: September 27 2012 15:43:22.