| |
15 September 2009 HPC backups
Over this past weekend (September 12-13) the backup
server for HPC backups began experiencing stability
problems. Currently the server is down.
This means that currently files being created or
modified are not being backed up and that we are
not able to restore files from backukp at this time.
We are working to get the server back online, however,
HPC users should take extra care in working with files
and maintain their own backup of critical files.
1 September 2009 /share
/share file system experienced errors and automatically
remounted itself read-only. Wednesday morning (2 September)
beginning at approximately 8am /share will be taken off
line to repair the file system. It is expected to be
available again by Thursday (3 September).
3 August 2009 henry2
About 7:30am the power connections for network switch for henry2
will be moved to provide improved redundancy. Power supplies in
the switch are redundant - so no service interruption is expected.
25 July 2009 /share4
During restore of some volume1 files, /share4 experienced
a file system error.
It has been umounted from all nodes and currently a
file system check is running to repair the file system.
Storage array for /share4 had experienced a disk failure.
Disk has been replaced and array rebuilt and is again
available from all cluster nodes.
19 July 2009 henry2
henry2 cluster will be rebooted between 00:00 and 01:00 on
Sunday 19 July. Any LSF jobs running at this time will be
lost.
Also, due to electrical work in the data center about half
the henry2 nodes will not be available until potentially 18:00 (6pm)
on Sunday 19 July. Nodes impacted by the power outage will include
the regular login nodes.
Login access during the power outage will continue to be available
by reserving HPC login node using the Virtual Computing Lab
(http://vcl.ncsu.edu/).
9 July 2009 henry2
An LSF error resulted in many LSF jobs being lost from the system.
These jobs continue to run on the compute nodes, but are no longer
visible to LSF.
Many LSF queues were closed to allow the lost jobs to
complete without LSF scheduling new jobs on top of them on the
same compute nodes.
Working also to identify the cause of the LSF error which occured
during a restart of the master batch daemon - which is a relatively
frequent event that until recently has not exhibited bad side
effects.
8 July 2009 /ncsu/volume1
File system has completely filled and is currently unusable.
Working to recover the file system.
22 June 2009 /ncsu/volume1
The mass storage file system /ncsu/volume1 will be migrated to new
server and storage hardware. To prepare for this migration the
file system is being remounted as read only. This will allow all
files on existing file system to be copied to the new file system
and verified.
Due to size of this file system it may take up to three weeks to complete
the migration. Once the migration is complete the file system will
be remounted read/write running from the new hardware.
15 May 2009 henry2
One of the computer room air conditioning (CRAC) units in the data center
where the henry2 cluster is located has partially failed. To reduce the
heat load in the data center a number of henry2 compute nodes have been
powered off. As jobs complete additional nodes will be powered off.
It is expected that the part needed to repair the CRAC unit will be
available Monday and the unit repaired Monday. Until the unit is repaired
queue wait times for henry2 will likely be longer as only a fraction of
the compute nodes will be available to run jobs.
16 April 2009 /ncsu/volume1
File system has passed the safe capacity threshold,
but is unable to keep up with rate that data is being
moved onto the file system.
File system has been remounted read-only to allow
sufficient free space to be recovered through file
migration.
12 February 2009 /ncsu/volume1
File system completely filled - again causing serious
problems for the file system. Working to repair file system.
5 February 2009 p5login
Access to p5login was lost.
Access was restored 6 Feb. The system has a failing
hardware component. System is no longer covered by
maintenance - so if the compoent fails the system
may or may not be repaired depending on cost.
6 January 2009 /home
Server for home file system is experiencing problems.
A new server is being prepared and is expected to be
installed Wednesday 7 January.
There are a number of issues related to the /home
file server problems. Many compute nodes have lost the
mount of /home resulting in these nodes being unreachable
via ssh (and therefore not able to successfully run jobs).
LSF queues have been changed to inactive status to prevent
jobs from starting and immediately failing from not being
able to reach some compute nodes.
Update new file server for /home has been installed
and cluster has returned to normal operation as of about
11:15am Wednesday 7 January.
25 December 2008 /home
file server for /home file system was unreachable. File
server has been restarted and /home is again available.
15 December 2008 p5compute02
Node p5compute02 unavailable. Remaining 3 p575
nodes continue in service.
Currently, identifing p5compute02 issue is delayed
while issues with an Ethernet switch serving
the POWER5 service network are resolved.
8 December 2008 /ncsu/volume1
Mass storage file system /ncsu/volume1 again
reached 100% full and is again unavailable while
the file system recovers some available space.
File system is
available again as of Wednesday, 10 December.
1 December 2008 /ncsu/volume2
14 November 2008 /ncsu/volume1
File system is again experiencing problems from
being too full.
3 November 2008 /ncsu/volume1
27 October 2008 /share3
There are file system problems with /share3.
This file system will be taken off line to
run file system check. Expected that checking
the file system will take approximately one
day.
Large parallel jobs should use /gpfs_share
instead of /share or /share3 (which are NFS
mounted file systems and are not suitable
for use from large parallel jobs where all
processes read or write to disk).
29 September 2008 /ncsu/volume2
/ncsu/volume2 has been remounted as a read-only
file system in preparation for move to new disk hardware.
Following the backup on 30 September, the contents
of volume2 will be restored from tape onto a new
disk array. Once that is complete - it is estimated
it will take about one week - the new disk and file
server will replace the current /ncsu/volume2 and it
will again be availabe for read and write operations.
14 September 2008 /gpfs_share
File system /gpfs_share is currently not responding. The
problem is being investigated.
Failed jobs on compute nodes still having open files
on /gpfs_share appear to have stalled the file sytem.
Compute nodes were rebooted and /gpfs_share appears
to be working normally again.
8 September 2008 p5login
p5login.hpc.ncsu.edu will be briefly unavailable
Monday morning to change its public IP address.
28 August 2008 /ncsu/volume1
File system /ncsu/volume1 filled to 100% capacity.
File server became unresponsive.
The file server has been rebooted, however, the
file system remains unavailable from login nodes
while files are migrated to tape to restore free
space in the file system.
We expect the file system to be available from
login nodes by 5pm Friday 29 August.
29 June 2008 henry2
HPC henry2 cluster was unavailable as a result of
the data center power outage.
The power outage left the cluster in a very confused
state with nearly all file systems in degraded states.
Access was restored to login nodes about 6:30pm and
to LSF queues about 7:15pm. A set of compute nodes,
including nearly all the 32-bit nodes remain unavailable.
We will resume work on restoring these to service tomorrow.
We very much regret this interruption of HPC services.
13 May 2008 /ncsu/volume1
Mass storage file system /ncsu/volume1 is not currently
available. The file sysem filled and is currently being
repaired. All files appear to be intact, but the
hierarchical storage management software needs some time
to clean up the file system without any additional
writes. Expect the file system to be available again
late on Wednesday 14 May.
5 May 2008 Blade Center logins will be disabled.
Maintenance was performed on the BladeCenter cluster (henry2)
beginning at about 8am. System was available again
about 6pm. A number of file systems and services were migrated to
new servers during this maintenance and new login nodes were
installed.
Following the maintenance login.hpc.ncsu.edu connects to a set
of login nodes running 32-bit Linux (as before, but with new quad-core
nodes) and login64.hpc.ncsu.edu connnects to a set of login nodes
running 64-bit Linux (instead of a single node - also now
quad-core nodes).
30 April 2008 VCL HPC nodes
VCL HPC nodes were down while a switch was reconfigured. They are
back up (Thursday, 1 May)
28 January 2008 GPFS
GPFS file systems will be unavailable from 8-10am to allow for
configuration changes associated with the increase in the
size of the /gpfs_share.
26 November 2007 /gpfs_share
In conjunction with the move of /home - since that will
effectively make the cluster unable to process jobs - /gpfs_share
will also be migrated to arrays with new disks. Following
this migration the capacity of /gpfs_share will increase
from its current 4TB to approximately 16TB. Per group quota
will remain unchanged for now.
25 November 2007 /home
/home file system will be moved to a disk array with new disk
drives. This move is in response to the drive failures that
occurred in mid-September.
During the move /home will be unavailable on both the Linux
cluster and the POWER5 system.
In preparation for the move LSF queues will stop scheduling
new jobs late Saturday November 24. Sunday November 25 the
current /home will be unmounted and a final backup done.
The new /home will then be mounted and the contents restored
from backup.
It is expected that /home will be available again by noon
on Monday November 26.
1 November thru 5 November 2007 /ncsu/volume1
The tape library which serves as the 2nd tier storage
for /ncsu/volume1 will be upgraded between 2-6pm on
Thursday November 1. During this time, files which
have been migrated to tape will not be available.
The expansion was not successful. The library is operational
and is able to complete daily backups. However it is not
able to access new tapes.
/ncsu/volume1 remains off line. The file system is over full
and the HSM software is working to free disk space and perform
maintenance on the file system.
Following firmware upgrade Friday Nov 2 hardware issues were
identified. New gripper and scanner have been ordered and
should be installed on Monday Nov 5.
/ncsu/volume1 is again available from HPC login nodes.
30 October 2007 login.hpc.ncsu.edu
Resolution of the name login.hpc.ncsu.edu changed from a load balancer to round robin DNS. Access will continue to be distributed between two 32-bit login nodes for the BladeCenter Linux Cluster (login01 and login02). Only the mechanism for this distribution was altered.
28 September 2007 /ncsu/volume1
/ncsu/volume1 file system completely filled overnight.
Due to the nature of this file system (hierarchically
managed) some amount of free disk space is essential
for it to operated. Currently the file system is
unusable and has been unmounted.
Some files will have to be removed from the file system
before it will be usable again. If you have any large
files on /ncsu/volume1 that you have another copy on
your own system that we could remove from /ncsu/volume1
please let us know.
13 September 2007 /home
/home file system is again off line. IO errors
caused the file server to remount the file system
read only. To correct the errors the file system
had to be taken off line. File system check is
being done to try to correct any file system errors.
11 September 2007 /home
Storage array holding the /home file system experienced
a disk failure - the disk was replaced,
however, before the RAID data recovery was complete a
second disk failed. The second disk failure resulted
in loss of data. The file system contents are being
restored from tape backup to a different disk array
Update - 13 September
The restore from tape took longer than expected due
to a very large number of small files in /home.
/home was back online at about 9:30pm 12 September.
Between 10 September and 12 September we have experienced
four disk failures in the storage arrays which contain
most of the HPC file systems (/home, /share, /share3,
/share4, and /gpfs_share). We will be ordering new
disk drives (current drives are four years old and out
of warranty) and migrating these file systems to the
new disks over the next few weeks.
6 September 2007 ANSYS and CFX
Licenses for ANSYS and CFX (now owned by ANSYS) have been
renewed. ANSYS has imposed new a license term which must
be individually accepted before access can be granted to
ANSYS or CFX. Visit the software link from the HPC home
page (http://hpc.ncsu.edu/) and the select the button
by ANSYS and CFX to request access.
ANSYS license also allows only a single version of ANSYS
to be in use on campus. That version is now version 11.
Default CFX version is also being changed to version 11
to be consistient with the ANSYS version.
Also only 64-bit versions of ANSYS and CFX are currently
available. These versions will not run on the default
login nodes. Use login03 to access the GUIs for ANSYS or
CFX and add "-R em64t" option on bsub commands for jobs
using ANSYS or CFX to ensure the job is scheduled on a
64-bit compute node.
2 September 2007 /home
File server for /home was down. This resulted in login
attempts on henry2 cluster hanging and logins on the POWER5
system receiving an error about missing /home file system.
File server for /home was restarted.
27 August 2007 POWER5 System
The POWER5 system will be unavailable due to required hardware
maintenance beginning at 8am. The system was expected to be
available again by 2pm, however, the hardware maintenance took
longer than originally expected. System was available for users
again as of 5pm.
11 August 2007 henry2
henry2 login nodes were again becoming unresponsive.
Server for /usr/local/apps was restarted.
9 August 2007 henry2
Linux cluster (henry2) 32-bit login nodes (login.hpc.ncsu.edu)
were unreachable. This appears to have been the result of many
large memory jobs being run at the same time on the login nodes.
Please do not use login nodes for running jobs. Any
resource intensive tasks should be submitted to LSF.
31 July 2007 POWER5
14 July 2007 CFX
The license for CFX expired July 13. We are working on
renewing the license. However, no estimate is currently
available for when (or even if) the renewal process
will be completed.
17 May 2007 /ncsu/volume1
/ncsu/volume1 is reporting disk errors and has been
unmounted from all login nodes while the problem is
evaluated.
24 April 2007 /gpfs_share
/gpfs_share file system crashed at about 9:30am. The
gpfs_share file servers have been brought back online
and the file system is availble again on the servers.
gpfs is being restarted on all henry2 login and
compute nodes.
10 April 2007 Linux clusters
Overnight the network connections for message passing
traffic between nodes were lost - causing parallel
jobs running across multiple chassis to end and new
jobs attempting to start across multiple chassis to
fail.
9 April 2007 /home file server
OS update for the /home file server was not able to
be completed in parallel with the network switch work.
Cluster remained unavailable until about 10am while
the /home file server is updated to the same Linux
version as the cluster login and compute nodes.
We regret this extended down time - but feel it was important to
get the file server OS updated.
As a side effect of the OS update, quota information
for /home was lost. Quotas will be reset - at a value
higher than current use - but not necessarily at the
same level as previously.
9 April 2007 Linux clusters down
From 6-8AM on Monday April 9th the cluster will be down
to update the core ethernet switch.
It is likely that any jobs running at that time will be
lost since network connections to storage and between
chassis will be disconnected.
To minimize lost work, queues will be paused Sunday
evening (April 8th) to allow as many jobs as possible to
complete prior to the network work Monday morning.
21 February 2007 /share
/share file system is again available read/write from
all henry2 nodes. The file server for /share has been
replaced. Also the mount options for this file system
are now identical to /share3. MPI-IO jobs should use
/gpfs_share. MPI-IO will no longer work reliably on
/share - however, performance of /share should now
be as good or better than /share3.
21 February 2007 henry2 login nodes
henry2 login nodes (login01 and login02) will be
replaced with newer servers (current servers are
no longer under maintenance).
The transition will happen in two steps. The
first step occurred on Monday 19 February.
login01 was replaced by a newer server running
a more recent version of Linux. Please report
any issues observed using the new login01.
Once all issues are resolved with the new login01 node -
login02 will also be replaced.
2 February 2007 /home file system
Responsiveness of /home file system deteriorated over
night until this morning logins became completely
impossible. Server for /home was rebooted along with
both login nodes (which appeared to be the source
of the /home problem).
29 January 2007 HPC core network switch
The core HPC network switch will be rebooted at 8am on
Monday January 29. The reboot will take about 5 minutes.
During the reboot connections to HPC cluster login nodes,
from HPC nodes to HPC storage, and for interchassis MPI
communications will be unavailable.
HPC jobs attempting to access HPC storage or communicate
between chassis during this time will fail.
To reduce the impact of the reboot, cluster queues will
not start new jobs after about noon on Sunday 28 January.
We regret this inconvenience, but the reboot in necessary
to apply a security update on the switch. Monday morning
was chosen as the time for the upgrade because that is
typically the time there are the fewest number of jobs
running on the cluster.
23-December-2006 /gpfs_share
Following the data center power failure, one
of four disk arrays used by /gpfs_share failed
to recover.
/gpfs_share was fully operational again about
2pm Dec 24.
20-November-2006 /ncsu/volume1
There are more than 250,000 migrated files in
/ncsu/volume1 file system. After applying a fix
provided by the file system vendor all but about
14 files have been repaired - all of these belong
to a single user.
/ncsu/volume1 is again available read/write
from HPC login nodes.
We very much regret this extensive period of
being unable to write to this important file
system. We are working with the vendor to
develop procedures to minimize the chance of
future disruptions.
4-October-2006 /ncsu/volume1
Migrated files on /ncsu/volume1 were not being
recalled on demand. File system is available
read only from login03.hpc.ncsu.edu while we
work with the vendor to correct the problem with
migrated files.
9-September-2006 /ncsu/volume1
A file system error occurred on /ncsu/volume1. The
file system was taken offline and repaired. File
system was off line from about 2pm Friday until
about 7:30am Saturday.
24-July-2006 /share and /share3 on henry2
File server for /share crashed about 3am and was
returned to service about 8am.
Several compute nodes remained in a busy state trying
to access /share3. About 9:30am the server for /share3
was rebooted to free the hanging compute nodes.
24-June-2006 henry2 Linux Cluster
Following a power outage Saturday afternoon
cooling for the HPC henry2 Linux Cluster was
lost. Cluster compute nodes were powered
down to minimize damage from the resulting
high temperatures. LSF jobs running on henry2
at 4pm Saturday were lost.
Please carefully check results from Saturday
jobs to be sure they completed correctly.
Also as a result of the loss of cooling the
server for /share3 failed. /share3 was returned
to service about 3pm Sunday 25 June.
11-May-2006/ncsu/volume1
Server for /ncsu/volume1 has again become unstable.
Update: 17-May Some file system and NFS
settings have been adjusted. File system is
currently mounted from login01 only.
25-Apr-2006/ncsu/volume1
/ncsu/volume1 is unavailable. Server
for this file system crashed.
Update: 5-May Server hardware has been
replaced and software reloaded. File system
is currently reconciling with the HSM database.
File system is available again as of about
3pm 5 May.
1-Mar-2006/ncsu/volume1
/ncsu/volume1 is now available from new
server and disk space for reading and
writing. This file system is now managed
by a hierarchical storage manager that
will migrate old, large files to tape.
Any access of migrated files will
restore them to disk - with some delay
as the tape is loaded and read.
15-Feb-2006Storage News
Two storage enhancements are underway on
HPC systems.
Henry2 GPFS - A GPFS (general parallel
file system) instance is being deployed on
the henry2 Linux cluster. This is the same
type of file system that is used for shared
scratch space on the POWER5 system. The
cluster implementation currently uses two
servers and has about 2TB capacity.
Testing so far has shown about 3X better performance
than the best performance seen with NFS
shared file systems (eg /share3).
If testing continues to go well, disk
resources currently allocated to /test_share
will be redeployed to gpfs along with another
TB of disk to provide 6TB of gpfs space
this spring. Eventually we expect that /share
and /share3 disks will also be reallocated
to gpfs to provide 8TB of gpfs space. Target
for this transition is during Spring 2006 exams.
Group quotas will be enforced on the GPFS
file system. Currently the group quota is
1TB. Also, like other shared scratch file
systems the gpfs space will not be backed
up and will be subject to a periodic purge
to maintain free space.
Mass Storage - Mass storage volume1 is
in the process of being migrated to a new
server. After migration this file system
will be managed with Tivoli Space Manager.
This will allow large files which have not
been recently accessed to be stored on tape
rather than disk, thereby making additional
storage space available in the /ncsu/volume1
file system. Actual additional space will
depend on compression ratio achieved in
storing files to tape, but it is estimated
that the current tape library capacity
will provide an additional 10TB of mass
storage space.
This will increase mass storage space to
approximately 26TB from the current 16TB
and is expected to be in operation by the
end of February.
6-Jan-2006 Major Maintenance Window
On Friday January 6 a major maintenance window
will be taken to make significant adjustments
to HPC systems:
- Two additional p575 nodes will be added to
the power5 system
- Network switches for power5 system will be
upgraded
- Network switches for mass storage system
will be relocated
Due to these changes the Power5 system will not
be available between 6am Friday and 6pm Friday.
Also, the mass storage directories (/ncsu/volume1
and /ncsu/volume2) will not be available from the
HPC Linux clusters (henry2 and tim) from 6am Friday
until 6pm Friday.
28-Dec-2005 /ncsu/volume[12] read-only
/ncsu/volume1 and /ncsu/volume2 will be read-only
from Wed Dec 28 through approximately Sat Dec 31
to allow the migration of /ncsu/volume1 to a
slightly larger file system.
5-Dec-2005 henry2 network outage
There will be a brief network outage for the
henry2 cluster Monday Dec 5 about 7:30am.
This outage is to allow the switch serving
the henry2 cluster to be upgraded. Outage
is expected to last about 10 minutes.
13-Nov-2005 /share file system on henry2
The /share file system will be unavailable for
10-15 minutes between 8pm and 8:30pm on Sunday
evening. This down time is needed to allow for
maintenance on the disk array serving /share.
LSF jobs running from /share could abort when
/share is taken off line. Uses planning to run
jobs over the weekend may want to run from
/share3 instead of /share.
7-Nov-2005 Power5 system
Power5 system network connections to internal
HPC network are down. This is resulting in
/home, /usr/local/apps, /ncsu/volume1, and
/ncsu/volume2 being unavailable. Comtech has
been notified of the problem.
6-Nov-2005 henry2 cluster
File server for /share file system had hung and had
to be rebooted. LSF jobs running from /share were
lost.
3-Nov-2005 henry2 cluster and power5 system
File server for /home file system had to be rebooted
to clear lots of hanging processes on login nodes.
22-Oct-2005 henry2 cluster
File system on management node holding LSF filled
overnight. This caused LSF to stop.
LSF was moved to a new, larger file system.
Jobs submitted after the old file system filled
would have been lost.
Please send email if any problems or unusual
behavior are observed with LSF on the cluster.
06-Oct-2005 /ncsu/volume[12] mass storage
Mass storage file systems are again available from
HPC login nodes.
During the next couple months the mass storage
file systems will be migrating from a single server
to a server for each file system. During this
transition there will be some periods of time
that the file systems will be read-only. These
read-only periods will be announced in News,
Sysnews, and login banners.
Once multiple servers are in place the mass
storage system will be much less likely to
be offline from a single component failure.
05-Oct-2005 /ncsu/volume[12] mass storage
Server for /ncsu/volume[12] is not responding.
Server has had a hardware failure, it is
being repaired.
03-Oct-2005 LSF licenses
There are ongoing issues with LSF licenses
on the power5 system.
At this time the power5 system is being
returned to friendly user mode due to
issues with batch processing.
We are working with the LSF vendor to
resolve these issues as quickly as possible.
We very much regret the inconvenience this
license problem is causing power5 users.
23-Sept-2005 LSF licenses
Renewal LSF licenses were installed yesterday.
This morning there were problems with LSF having
the correct licenses for scheduling parallel jobs.
The version of LSF running on the henry2 Linux
cluster was updated from 5.1 to 6.1. Users should
log off and back on the cluster before submitting
jobs to ensure that their environment is correctly
configured for the new LSF version.
19-Sept-2005power5 system
As of Monday September 19 the power5 system is
in production operation.
The system supports large memory (up to 32GB
of physical memory) jobs using up to 8 processors.
Fortran, C, and C++ compilers are available to
build user applications. MPI or OpenMP parallelization
are supported on the system.
For more information regarding the power5 system
see the power5 "How to" page:
http://hpc.ncsu.edu/Documents/SharedMemory/GettingStartedp5.php
12-Sept-2005 login.hpc.ncsu.edu
The load balancer for login.hpc.ncsu.edu will be changed
at midnight Monday 12 September. Open sessions will be
dropped. However, new ssh sessions should be immediately
available through the new load balancer.
31-Aug-2005IBM p575
1-Aug-2005IBM p575
- The new shared memory system has two IBM p575 compute nodes each with 8 1.9GHz single-core POWER5 processors
and 32GB of memory; an IBM p550 login node with two 1.65GHz dual-core POWER5 processors and 8 GB memory;
and 2TB shared scratch space available to all three nodes using IBM's general parallel file system (gpfs).
All shared file systems are also available.
Please note that the new system and the
Henry2 cluster share the /home directory (with p5 home being in /home//p5, and
Henry2 contiuning to be in the /home/ directory).
Also, please note that your userid on p5 is now accessible using your unity password.
7-July-2005 henry2 software updates
- Totalview debugger - has been updated to version 7.0.0-1
License permits debugging of parallel jobs using up to 4 processors
- CFX5 - has been updated to release 5.7.1
Also the renewed license includes 16 parallel processing
licenses - please limit use to no more than 8 tasks for
a single job.
1-July-2005 cluser /home file syste
Since the operating system update on the
henry2 cluster there have been a number of
software failures on the server for the
/home file system. Efforts to identify the
cause of these failures have not been
successful. On June 30 the server experienced
three failures. Following the second June 30
failure a new server was configured and
migration of /home to the new server began.
Following the third June 30 failure the migration
to the new server was completed.
Jobs using files on /home may have encountered
problems June 30 due to the number of failures
and extended period of the third outage as
the transfer to the new server was completed.
The new server has twice as much physical memory
as the previous server and is running the same
Linux distribution and kernel as the cluster nodes
(whereas before the server was running a different
kernel).
We will continue to closely monitor the server
for /home and regret the inconvenience the
previous failures have caused.
1-July-2005 IBM p690
Production use of IBM p690 ended June 30, 2005.
A replacement system based on POWER5 processors
has been delivered and is expected to be installed
within the next few days. During the transition
to the new system the p690 will remain available,
however, it is no longer under maintenance so
any hardware failures may not be repaired.
Timeline for friendly user access to the new
system will be posted once installation is
complete.
Output from any jobs run on the p690 during
this transition period should be copied off
as soon as possible - keeping in mind that
the system is no longer under maintenance.
6-June-2005 IBM p690 Replacement
The IBM p690 which NC State has operated for the
past two years will be replaced with a new
shared memory computing system. The new system
will be installed in mid-June and it is planned
to retire the p690 at the end of July.
NC State has been paying the annual hardware maintenance
costs for the p690. The replacement system has been
acquired for approximately the amount that would
have been spent renewing the p690 maintenance for
another year.
Existing p690 hardware maintenance expires at
the end of June. While it is planned to continue
operating the p690 through July, any hardware
failure during this time would likely not be
repaired. Users should be careful to get data
off the p690 prompty when runs complete.
The new shared memory system will have two
IBM p575 compute nodes each with 8 1.9GHz
single-core POWER5 processors and 32GB of
memory; an IBM p550 login node with two 1.65GHz
dual-core POWER5 processors and 8 GB memory; and
2TB shared scratch space available to all three
nodes using IBM's general parallel file system (gpfs).
2-June-2005 henry2 /home file system
24-May-2005 henry2 /home file system
12-May-2005 Intel Compilers - Henry2 Linux Cluster
source /usr/local/intel/compiler70/ia32/bin/ifcvars.csh
to access 7.1 instead of 8.1
8-May-2005 Henry2 Linux Cluster
Henry2 linux cluster login nodes were not responding.
File server for home file system was down. The
server was upgraded and returned to service. Login
nodes available as of 10:30am
26-Mar-2005 IBM p690 (mcrae)
IBM p690 (mcrae) went down about 12:30 Saturday
afternoon. System was rebooted and returned to
service around 4:30pm. About 9:30pm the system
crashed again.
Access to the p690 was restored about 4pm on
Monday (28 March). LSF jobs that were running
at the times the system crashed were lost. Jobs
waiting in LSF queue were not affected.
8-Mar-2005 /ncsu/volume2
7-Mar-2005 /ncsu/volume2
Half of the university storage management system (SMS) is
being relocated. While this part of the SMS is offline,
/ncsu/volume2 will be available read only from
the backup version.
Users may find some files in the backup that they
previously deleted.
It is expected that the SMS will be back online by
Wednesday March 9.
3-Mar-2005 p690 (mcrae) LSF ok
A combination of network link updates and license changes
to/at MCNC has caused instability in both LSF and mpiexec.
All issues appear to have been resolved.
24-Feb-2005 p690 (mcrae) LSF Down
IBM p690 (mcrae) has lost connection to LSF license server.
LSF is down. Running and queued jobs should not be affected.
New jobs are not being accepted nor new jobs started running.
Working to identify cause for loss of connection. Basic LSF
operation was restored before 5pm.
18-Feb-2005 /ncsu/volume2 Unavailable
From about 9pm Friday 18-Feb /ncsu/volume2 will be unavailable
due to maintenance on the university storage management
system. /ncsu/volume2 will be back online by 8am Saturday 19-Feb.
16-Feb-2005 /ncsu/volume2 and /ncsu/volume1 Unavailable from Clusters
/ncsu/volume1 and /ncsu/volume2 have been intermittently unavailable
from the cluster today. Access from the p690 has not been affected.
Problem was resolved in network by late afternoon.
12-Feb-2005 /ncsu/volume2 Unavailable
From about 9pm Friday 11-Feb /ncsu/volume2 was unavailable
due to maintenance on the university storage management
system. /ncsu/volume2 was back online before 8am Saturday 12-Feb.
2-Feb-2005 /share Unavailable
Thursday Feb 3, /share file system will be unavailable
briefly about 8am. The server for /share file system
will be rebooted to bring online additional storage.
19-Jan-2005 Clusters find a new home
HPC Xeon cluster (henry2) and the Opteron cluster (Tim)
have moved to the new Computer Disaster Releif machine
room... Cooool space!!!
16-Jan-2005 Cluster Move Update
HPC Xeon cluster (henry2) was returned to service
around 6pm on Sunday Jan 16. LSF queues were restarted
a few hours earlier.
Only one login node is currently available for
henry2 cluster. Second login node should be available
again Tuesday.
Opteron test cluster is expected to be available for
use again by end of day Tuesday Jan 18.
13-Jan-2005 /ncsu/volume[12] File Systems
Users will have access to /ncsu/volume[12] during the cluster move
starting Friday January 14 (but will need a mcrae account).
28-Dec-2004 University Linux Clusters Moving
The university Linux Clusters (henry2 and tim) will
be moved to the new data center. Currently it is
expected the move will begin Friday January 14 and
be completed by Tuesday January 18.
In preparation for the move LSF queues will stop
starting new jobs around noon of Thursday January 13
(except for debug queue). Queued jobs that have not
started should requeue after the move without problems.
Jobs running when the cluster goes down for the
move will likely be lost.
Data stored on /ncsu/volume[12] will be continue to
be avalable from mcrae.hpc.ncsu.edu.
15-Dec-2004 Opteron Test Cluster
A small (4 compute node + 1 interactive node) AMD Opteron
cluster (tim) is now available for testing by friendly users.
The cluster uses IBM e325 dual Opteron servers with 2GHz
processors. The compute nodes each have 9 GB of memory.
Opterons will run x86 binaries and the Portland Group x86-64
compilers are available to develop 64-bit executables.
Contact eric_sills@ncsu.edu if interested in
being a friendly user of the Opteron cluster.
10-Dec-2004 henry2 reboot
The cluster head node will be rebooted Friday morning to
attempt to clear some issues being observed with file systems.
During the reboot access to /home and /usr/local file systems
will be lost.
25-Oct-2004 henry2 login nodes
Login sessions via ssh to the henry2 cluster should
be to login.hpc.ncsu.edu
This will direct the login session to one of currently
two login nodes. This avoids a potential single point
of failure for the cluster and also permits easy
expansion of login nodes if needed to support future
use.
Access to henry2.hpc.ncsu.edu has been restricted.
19-Oct-2004 Mass Storage Offline
/ncsu/volume1 and /ncsu/volume2 became unreachable from
henry2 around 3:30pm. By 4:30pm these file systems were
also unreachable from mcrae. File server was rebooted and
connectivity restored around 5:30pm.
18-Oct-2004 Perros appointed to NLR Network Research Council
Dr. Harry Perros (NC State, Computer Science) has been appointed to the
the National Lambda Rail (
www.nlr.org) Network Research Council. (NRL NRC).
A significant portion of the NLR facilities are to be devoted to research in networking. NLR NRC will provide
both guidance to the Board of NLR and to inform the networking community as to this opportunity. NLR NRC to provide
input on what are the critical research issues that can utilize the advanced capabilities of the NLR network.
Members are
Paul Barford, University of Wisconsin-Madison
Dan Blumenthal, University of California, Santa Barbara
Javad Boroumand, Cisco Systems
Hank Dardy, Naval Research Laboratory
Constantinos Dovrolis, Georgia Tech
David Farber, Carnegie Mellon University (chair)
Gerald Faulhaber, University of Pennsylvania
Paul Francis, Cornell University
Larry Landweber, University of Wisconsin-Madison and Internet2 (ex officio)
Jason Leigh, University of Illinois-Chicago
Steven Low, Caltech
Mike O'Dell, unaffiliated
Phil Papadopoulos, University of California, San Diego
Craig Partridge, BBN Technologies
Guru Parulkar, National Science Foundation
Harry Perros, North Carolina State University
14-Oct-2004 Thom Dunning to lead NCSA
12-Oct-2004 ORNL Positions
11-Oct-2004 - mcrae (IBM p690) response slow
On Thursday 30 Sept response from IBM p690 (mcrae) became
very slow and system was rebooted. Unfortunately, LSF jobs were
lost during the reboot.
Following the reboot, LSF MPI jobs failed with either
license or PJL errors, until configuration changes were
made on Monday 3 October.
p690 again began to display very slow response on
Saturday 9 October. System was rebooted Sunday 10
October. All running LSF jobs had completed prior
to the reboot.
29-Sep-2004 - henry2 again available
- ssh connectivity to henry2 was lost from approixmately
8pm Friday 24 Sept until 10:30am Saturday 25 Sept.
- ssh connectivity was again lost from approximately
12:30-1:30pm on Monday 27 Sept.
- ssh connectivity was again lost from approximately
11pm-midnight on Wednesday 29 Sept.
LSF jobs continued to run on compute nodes. Jobs
using /home during the above times should be examined
closely for any problems.
23-Sep-2004 - Parallel Jobs on Henry2 Cluster
LSF job scripts that explicitly invoke 'pam'
should revert to using 'mpiexec' command to
execute MPICH jobs.
Due to upcoming LSF license expiration and renewal,
there may be periods during which pam will fail
with an error message saying 'Node not licensed'.
The 'mpiexec' command will be altered as needed
during the license transition to utilize the
best MPICH execution mechanism available.
6-Sep-2004 - IBM Announces "Open Architecture" for Blades
6-Sep-2004 - NC State Virtual Laboratory
- Virtual Computing Laboratory is here [more ...]
6-Sep-2004 - Conferences
- ITD Booth at EdTech -come and visit us [more ...]
- UNC CAUSE - HPC will present [more ...]
- Supercomputing 2004 [more ...]
6-Sep-2004 - Processors added to cluster
The Henry2 cluster now has a total of 208 processors.
6-Sep-2004 - HPC and Grid Courses
HPC Group is involved in teaching two graduate and one
undergraduate course related to HPC and Grid computing.
11-August-2004 - henry2
Old News (7/1/03-7/31/04)