Older Operations News
- Two additional p575 nodes will be added to the power5 system
- Network switches for power5 system will be upgraded
- Network switches for mass storage system will be relocated
- The new shared memory system has two IBM p575 compute nodes each with 8 1.9GHz single-core POWER5 processors
and 32GB of memory; an IBM p550 login node with two 1.65GHz dual-core POWER5 processors and 8 GB memory;
and 2TB shared scratch space available to all three nodes using IBM's general parallel file system (gpfs).
All shared file systems are also available.
Please note that the new system and the Henry2 cluster share the /home directory (with p5 home being in /home/
/p5, and Henry2 contiuning to be in the /home/ directory). Also, please note that your userid on p5 is now accessible using your unity password.
- Totalview debugger - has been updated to version 7.0.0-1
License permits debugging of parallel jobs using up to 4 processors - CFX5 - has been updated to release 5.7.1
Also the renewed license includes 16 parallel processing licenses - please limit use to no more than 8 tasks for a single job.
31 December 2009 henry2 /share3
-
/share3 encounterd a file system error and was automatically
remounted as read-only at approximately 5am on Dec 31. To
repair the file system it will need to be taken off line and
checked. With 4TB file systemt the file system check can take
several hours. We plan to run the file system check on
January 1, 2010. /share3 will be unavailable much of the day
on January 1st.
17 December 2009 henry2 /share and /share3
-
The automated purge of /share and /share3 has been restarted.
Oldest files on these file systems will be automatically
deleted to maintain available space. This became necessary
as file system use was continually over 90% resulting in no
space for running new jobs and potential degraded file system
performance (as file system housekeeping needs some free space
to operate effectively).
We anticipate that additional shared scratch space will be added to henry2 cluster during 2010.
14 December 2009 sam cluster
-
During winter break additional compute node
resources are being allocated to the sam
Linux cluster. These are servers that primarily
deliver student applications via VCL during
academic semester to students at UNC institutions
and NC Community Colleges.
15 September 2009 HPC backups
-
Over this past weekend (September 12-13) the backup
server for HPC backups began experiencing stability
problems. Currently the server is down.
This means that currently files being created or modified are not being backed up and that we are not able to restore files from backukp at this time.
We are working to get the server back online, however, HPC users should take extra care in working with files and maintain their own backup of critical files.
1 September 2009 /share
-
/share file system experienced errors and automatically
remounted itself read-only. Wednesday morning (2 September)
beginning at approximately 8am /share will be taken off
line to repair the file system. It is expected to be
available again by Thursday (3 September).
3 August 2009 henry2
-
About 7:30am the power connections for network switch for henry2
will be moved to provide improved redundancy. Power supplies in
the switch are redundant - so no service interruption is expected.
25 July 2009 /share4
-
During restore of some volume1 files, /share4 experienced
a file system error.
It has been umounted from all nodes and currently a file system check is running to repair the file system.
Storage array for /share4 had experienced a disk failure. Disk has been replaced and array rebuilt and is again available from all cluster nodes.
19 July 2009 henry2
-
henry2 cluster will be rebooted between 00:00 and 01:00 on
Sunday 19 July. Any LSF jobs running at this time will be
lost.
Also, due to electrical work in the data center about half the henry2 nodes will not be available until potentially 18:00 (6pm) on Sunday 19 July. Nodes impacted by the power outage will include the regular login nodes.
Login access during the power outage will continue to be available by reserving HPC login node using the Virtual Computing Lab (http://vcl.ncsu.edu/).
9 July 2009 henry2
-
An LSF error resulted in many LSF jobs being lost from the system.
These jobs continue to run on the compute nodes, but are no longer
visible to LSF.
Many LSF queues were closed to allow the lost jobs to complete without LSF scheduling new jobs on top of them on the same compute nodes.
Working also to identify the cause of the LSF error which occured during a restart of the master batch daemon - which is a relatively frequent event that until recently has not exhibited bad side effects.
8 July 2009 /ncsu/volume1
-
File system has completely filled and is currently unusable.
Working to recover the file system.
22 June 2009 /ncsu/volume1
-
The mass storage file system /ncsu/volume1 will be migrated to new
server and storage hardware. To prepare for this migration the
file system is being remounted as read only. This will allow all
files on existing file system to be copied to the new file system
and verified.
Due to size of this file system it may take up to three weeks to complete the migration. Once the migration is complete the file system will be remounted read/write running from the new hardware.
15 May 2009 henry2
-
One of the computer room air conditioning (CRAC) units in the data center
where the henry2 cluster is located has partially failed. To reduce the
heat load in the data center a number of henry2 compute nodes have been
powered off. As jobs complete additional nodes will be powered off.
It is expected that the part needed to repair the CRAC unit will be available Monday and the unit repaired Monday. Until the unit is repaired queue wait times for henry2 will likely be longer as only a fraction of the compute nodes will be available to run jobs.
16 April 2009 /ncsu/volume1
-
File system has passed the safe capacity threshold,
but is unable to keep up with rate that data is being
moved onto the file system.
File system has been remounted read-only to allow sufficient free space to be recovered through file migration.
12 February 2009 /ncsu/volume1
-
File system completely filled - again causing serious
problems for the file system. Working to repair file system.
5 February 2009 p5login
-
Access to p5login was lost.
Access was restored 6 Feb. The system has a failing hardware component. System is no longer covered by maintenance - so if the compoent fails the system may or may not be repaired depending on cost.
6 January 2009 /home
-
Server for home file system is experiencing problems.
A new server is being prepared and is expected to be
installed Wednesday 7 January.
There are a number of issues related to the /home file server problems. Many compute nodes have lost the mount of /home resulting in these nodes being unreachable via ssh (and therefore not able to successfully run jobs).
LSF queues have been changed to inactive status to prevent jobs from starting and immediately failing from not being able to reach some compute nodes.
Update new file server for /home has been installed and cluster has returned to normal operation as of about 11:15am Wednesday 7 January.
25 December 2008 /home
-
file server for /home file system was unreachable. File
server has been restarted and /home is again available.
-
Node p5compute02 unavailable. Remaining 3 p575
nodes continue in service.
Currently, identifing p5compute02 issue is delayed while issues with an Ethernet switch serving the POWER5 service network are resolved.
8 December 2008 /ncsu/volume1
-
Mass storage file system /ncsu/volume1 again
reached 100% full and is again unavailable while
the file system recovers some available space.
File system is available again as of Wednesday, 10 December.
1 December 2008 /ncsu/volume2
-
Fiber Channel switch connecting disk array
for /ncsu/volume2 failed.
Power supply was replaced and file system returned to service 2 Dec.
14 November 2008 /ncsu/volume1
-
File system is again experiencing problems from
being too full.
3 November 2008 /ncsu/volume1
-
There are file system problems with /ncsu/volume1
possibly resulting from the file system filling to
100%.
File system has been repaired and remounted on login nodes.
27 October 2008 /share3
-
There are file system problems with /share3.
This file system will be taken off line to
run file system check. Expected that checking
the file system will take approximately one
day.
Large parallel jobs should use /gpfs_share instead of /share or /share3 (which are NFS mounted file systems and are not suitable for use from large parallel jobs where all processes read or write to disk).
29 September 2008 /ncsu/volume2
-
/ncsu/volume2 has been remounted as a read-only
file system in preparation for move to new disk hardware.
Following the backup on 30 September, the contents of volume2 will be restored from tape onto a new disk array. Once that is complete - it is estimated it will take about one week - the new disk and file server will replace the current /ncsu/volume2 and it will again be availabe for read and write operations.
14 September 2008 /gpfs_share
-
File system /gpfs_share is currently not responding. The
problem is being investigated.
Failed jobs on compute nodes still having open files on /gpfs_share appear to have stalled the file sytem. Compute nodes were rebooted and /gpfs_share appears to be working normally again.
8 September 2008 p5login
-
p5login.hpc.ncsu.edu will be briefly unavailable
Monday morning to change its public IP address.
28 August 2008 /ncsu/volume1
-
File system /ncsu/volume1 filled to 100% capacity.
File server became unresponsive.
The file server has been rebooted, however, the file system remains unavailable from login nodes while files are migrated to tape to restore free space in the file system.
We expect the file system to be available from login nodes by 5pm Friday 29 August.
29 June 2008 henry2
-
HPC henry2 cluster was unavailable as a result of
the data center power outage.
The power outage left the cluster in a very confused state with nearly all file systems in degraded states.
Access was restored to login nodes about 6:30pm and to LSF queues about 7:15pm. A set of compute nodes, including nearly all the 32-bit nodes remain unavailable. We will resume work on restoring these to service tomorrow.
We very much regret this interruption of HPC services.
13 May 2008 /ncsu/volume1
-
Mass storage file system /ncsu/volume1 is not currently
available. The file sysem filled and is currently being
repaired. All files appear to be intact, but the
hierarchical storage management software needs some time
to clean up the file system without any additional
writes. Expect the file system to be available again
late on Wednesday 14 May.
5 May 2008 Blade Center logins will be disabled.
-
Maintenance was performed on the BladeCenter cluster (henry2)
beginning at about 8am. System was available again
about 6pm. A number of file systems and services were migrated to
new servers during this maintenance and new login nodes were
installed.
Following the maintenance login.hpc.ncsu.edu connects to a set of login nodes running 32-bit Linux (as before, but with new quad-core nodes) and login64.hpc.ncsu.edu connnects to a set of login nodes running 64-bit Linux (instead of a single node - also now quad-core nodes).
30 April 2008 VCL HPC nodes
-
VCL HPC nodes were down while a switch was reconfigured. They are
back up (Thursday, 1 May)
28 January 2008 GPFS
-
GPFS file systems will be unavailable from 8-10am to allow for
configuration changes associated with the increase in the
size of the /gpfs_share.
26 November 2007 /gpfs_share
-
In conjunction with the move of /home - since that will
effectively make the cluster unable to process jobs - /gpfs_share
will also be migrated to arrays with new disks. Following
this migration the capacity of /gpfs_share will increase
from its current 4TB to approximately 16TB. Per group quota
will remain unchanged for now.
25 November 2007 /home
-
/home file system will be moved to a disk array with new disk
drives. This move is in response to the drive failures that
occurred in mid-September.
During the move /home will be unavailable on both the Linux cluster and the POWER5 system.
In preparation for the move LSF queues will stop scheduling new jobs late Saturday November 24. Sunday November 25 the current /home will be unmounted and a final backup done. The new /home will then be mounted and the contents restored from backup.
It is expected that /home will be available again by noon on Monday November 26.
1 November thru 5 November 2007 /ncsu/volume1
-
The tape library which serves as the 2nd tier storage
for /ncsu/volume1 will be upgraded between 2-6pm on
Thursday November 1. During this time, files which
have been migrated to tape will not be available.
The expansion was not successful. The library is operational and is able to complete daily backups. However it is not able to access new tapes.
/ncsu/volume1 remains off line. The file system is over full and the HSM software is working to free disk space and perform maintenance on the file system.
Following firmware upgrade Friday Nov 2 hardware issues were identified. New gripper and scanner have been ordered and should be installed on Monday Nov 5.
/ncsu/volume1 is again available from HPC login nodes.
30 October 2007 login.hpc.ncsu.edu
-
Resolution of the name login.hpc.ncsu.edu changed from a load balancer to round robin DNS. Access will continue to be distributed between two 32-bit login nodes for the BladeCenter Linux Cluster (login01 and login02). Only the mechanism for this distribution was altered.
28 September 2007 /ncsu/volume1
-
/ncsu/volume1 file system completely filled overnight.
Due to the nature of this file system (hierarchically
managed) some amount of free disk space is essential
for it to operated. Currently the file system is
unusable and has been unmounted.
Some files will have to be removed from the file system before it will be usable again. If you have any large files on /ncsu/volume1 that you have another copy on your own system that we could remove from /ncsu/volume1 please let us know.
13 September 2007 /home
-
/home file system is again off line. IO errors
caused the file server to remount the file system
read only. To correct the errors the file system
had to be taken off line. File system check is
being done to try to correct any file system errors.
11 September 2007 /home
-
Storage array holding the /home file system experienced
a disk failure - the disk was replaced,
however, before the RAID data recovery was complete a
second disk failed. The second disk failure resulted
in loss of data. The file system contents are being
restored from tape backup to a different disk array
Update - 13 September
The restore from tape took longer than expected due
to a very large number of small files in /home.
/home was back online at about 9:30pm 12 September.
Between 10 September and 12 September we have experienced four disk failures in the storage arrays which contain most of the HPC file systems (/home, /share, /share3, /share4, and /gpfs_share). We will be ordering new disk drives (current drives are four years old and out of warranty) and migrating these file systems to the new disks over the next few weeks.
6 September 2007 ANSYS and CFX
-
Licenses for ANSYS and CFX (now owned by ANSYS) have been
renewed. ANSYS has imposed new a license term which must
be individually accepted before access can be granted to
ANSYS or CFX. Visit the software link from the HPC home
page (http://hpc.ncsu.edu/) and the select the button
by ANSYS and CFX to request access.
ANSYS license also allows only a single version of ANSYS to be in use on campus. That version is now version 11. Default CFX version is also being changed to version 11 to be consistient with the ANSYS version.
Also only 64-bit versions of ANSYS and CFX are currently available. These versions will not run on the default login nodes. Use login03 to access the GUIs for ANSYS or CFX and add "-R em64t" option on bsub commands for jobs using ANSYS or CFX to ensure the job is scheduled on a 64-bit compute node.
2 September 2007 /home
-
File server for /home was down. This resulted in login
attempts on henry2 cluster hanging and logins on the POWER5
system receiving an error about missing /home file system.
File server for /home was restarted.
27 August 2007 POWER5 System
-
The POWER5 system will be unavailable due to required hardware
maintenance beginning at 8am. The system was expected to be
available again by 2pm, however, the hardware maintenance took
longer than originally expected. System was available for users
again as of 5pm.
11 August 2007 henry2
-
henry2 login nodes were again becoming unresponsive.
Server for /usr/local/apps was restarted.
9 August 2007 henry2
-
Linux cluster (henry2) 32-bit login nodes (login.hpc.ncsu.edu)
were unreachable. This appears to have been the result of many
large memory jobs being run at the same time on the login nodes.
Please do not use login nodes for running jobs. Any
resource intensive tasks should be submitted to LSF.
31 July 2007 POWER5
-
Both university owned nodes of the POWER5 system were
inoperable. General queues on the POWER5
were inactivated until at least one of the university
owned nodes could be returned to service.
As of approximately 6:30pm on 2 August 2007 all POWER5 nodes had been returned to service.
14 July 2007 CFX
-
The license for CFX expired July 13. We are working on
renewing the license. However, no estimate is currently
available for when (or even if) the renewal process
will be completed.
17 May 2007 /ncsu/volume1
-
/ncsu/volume1 is reporting disk errors and has been
unmounted from all login nodes while the problem is
evaluated.
24 April 2007 /gpfs_share
-
/gpfs_share file system crashed at about 9:30am. The
gpfs_share file servers have been brought back online
and the file system is availble again on the servers.
gpfs is being restarted on all henry2 login and
compute nodes.
10 April 2007 Linux clusters
-
Overnight the network connections for message passing
traffic between nodes were lost - causing parallel
jobs running across multiple chassis to end and new
jobs attempting to start across multiple chassis to
fail.
9 April 2007 /home file server
-
OS update for the /home file server was not able to
be completed in parallel with the network switch work.
Cluster remained unavailable until about 10am while
the /home file server is updated to the same Linux
version as the cluster login and compute nodes.
We regret this extended down time - but feel it was important to
get the file server OS updated.
As a side effect of the OS update, quota information for /home was lost. Quotas will be reset - at a value higher than current use - but not necessarily at the same level as previously.
9 April 2007 Linux clusters down
-
From 6-8AM on Monday April 9th the cluster will be down
to update the core ethernet switch.
It is likely that any jobs running at that time will be lost since network connections to storage and between chassis will be disconnected.
To minimize lost work, queues will be paused Sunday evening (April 8th) to allow as many jobs as possible to complete prior to the network work Monday morning.
21 February 2007 /share
-
/share file system is again available read/write from
all henry2 nodes. The file server for /share has been
replaced. Also the mount options for this file system
are now identical to /share3. MPI-IO jobs should use
/gpfs_share. MPI-IO will no longer work reliably on
/share - however, performance of /share should now
be as good or better than /share3.
21 February 2007 henry2 login nodes
-
henry2 login nodes (login01 and login02) will be
replaced with newer servers (current servers are
no longer under maintenance).
The transition will happen in two steps. The first step occurred on Monday 19 February. login01 was replaced by a newer server running a more recent version of Linux. Please report any issues observed using the new login01.
Once all issues are resolved with the new login01 node - login02 will also be replaced.
2 February 2007 /home file system
-
Responsiveness of /home file system deteriorated over
night until this morning logins became completely
impossible. Server for /home was rebooted along with
both login nodes (which appeared to be the source
of the /home problem).
29 January 2007 HPC core network switch
-
The core HPC network switch will be rebooted at 8am on
Monday January 29. The reboot will take about 5 minutes.
During the reboot connections to HPC cluster login nodes,
from HPC nodes to HPC storage, and for interchassis MPI
communications will be unavailable.
HPC jobs attempting to access HPC storage or communicate between chassis during this time will fail.
To reduce the impact of the reboot, cluster queues will not start new jobs after about noon on Sunday 28 January.
We regret this inconvenience, but the reboot in necessary to apply a security update on the switch. Monday morning was chosen as the time for the upgrade because that is typically the time there are the fewest number of jobs running on the cluster.
23-December-2006 /gpfs_share
-
Following the data center power failure, one
of four disk arrays used by /gpfs_share failed
to recover.
/gpfs_share was fully operational again about 2pm Dec 24.
20-November-2006 /ncsu/volume1
-
There are more than 250,000 migrated files in
/ncsu/volume1 file system. After applying a fix
provided by the file system vendor all but about
14 files have been repaired - all of these belong
to a single user.
/ncsu/volume1 is again available read/write from HPC login nodes.
We very much regret this extensive period of being unable to write to this important file system. We are working with the vendor to develop procedures to minimize the chance of future disruptions.
4-October-2006 /ncsu/volume1
-
Migrated files on /ncsu/volume1 were not being
recalled on demand. File system is available
read only from login03.hpc.ncsu.edu while we
work with the vendor to correct the problem with
migrated files.
9-September-2006 /ncsu/volume1
-
A file system error occurred on /ncsu/volume1. The
file system was taken offline and repaired. File
system was off line from about 2pm Friday until
about 7:30am Saturday.
24-July-2006 /share and /share3 on henry2
-
File server for /share crashed about 3am and was
returned to service about 8am.
Several compute nodes remained in a busy state trying to access /share3. About 9:30am the server for /share3 was rebooted to free the hanging compute nodes.
24-June-2006 henry2 Linux Cluster
-
Following a power outage Saturday afternoon
cooling for the HPC henry2 Linux Cluster was
lost. Cluster compute nodes were powered
down to minimize damage from the resulting
high temperatures. LSF jobs running on henry2
at 4pm Saturday were lost.
Please carefully check results from Saturday jobs to be sure they completed correctly.
Also as a result of the loss of cooling the server for /share3 failed. /share3 was returned to service about 3pm Sunday 25 June.
11-May-2006/ncsu/volume1
-
Server for /ncsu/volume1 has again become unstable.
Update: 17-May Some file system and NFS settings have been adjusted. File system is currently mounted from login01 only.
25-Apr-2006/ncsu/volume1
-
/ncsu/volume1 is unavailable. Server
for this file system crashed.
Update: 5-May Server hardware has been replaced and software reloaded. File system is currently reconciling with the HSM database. File system is available again as of about 3pm 5 May.
1-Mar-2006/ncsu/volume1
-
/ncsu/volume1 is now available from new
server and disk space for reading and
writing. This file system is now managed
by a hierarchical storage manager that
will migrate old, large files to tape.
Any access of migrated files will
restore them to disk - with some delay
as the tape is loaded and read.
15-Feb-2006Storage News
-
Two storage enhancements are underway on
HPC systems.
Henry2 GPFS - A GPFS (general parallel file system) instance is being deployed on the henry2 Linux cluster. This is the same type of file system that is used for shared scratch space on the POWER5 system. The cluster implementation currently uses two servers and has about 2TB capacity.
Testing so far has shown about 3X better performance than the best performance seen with NFS shared file systems (eg /share3).
If testing continues to go well, disk resources currently allocated to /test_share will be redeployed to gpfs along with another TB of disk to provide 6TB of gpfs space this spring. Eventually we expect that /share and /share3 disks will also be reallocated to gpfs to provide 8TB of gpfs space. Target for this transition is during Spring 2006 exams.
Group quotas will be enforced on the GPFS file system. Currently the group quota is 1TB. Also, like other shared scratch file systems the gpfs space will not be backed up and will be subject to a periodic purge to maintain free space.
Mass Storage - Mass storage volume1 is in the process of being migrated to a new server. After migration this file system will be managed with Tivoli Space Manager. This will allow large files which have not been recently accessed to be stored on tape rather than disk, thereby making additional storage space available in the /ncsu/volume1 file system. Actual additional space will depend on compression ratio achieved in storing files to tape, but it is estimated that the current tape library capacity will provide an additional 10TB of mass storage space.
This will increase mass storage space to approximately 26TB from the current 16TB and is expected to be in operation by the end of February.
6-Jan-2006 Major Maintenance Window
-
On Friday January 6 a major maintenance window
will be taken to make significant adjustments
to HPC systems:
Also, the mass storage directories (/ncsu/volume1 and /ncsu/volume2) will not be available from the HPC Linux clusters (henry2 and tim) from 6am Friday until 6pm Friday.
28-Dec-2005 /ncsu/volume[12] read-only
-
/ncsu/volume1 and /ncsu/volume2 will be read-only
from Wed Dec 28 through approximately Sat Dec 31
to allow the migration of /ncsu/volume1 to a
slightly larger file system.
5-Dec-2005 henry2 network outage
-
There will be a brief network outage for the
henry2 cluster Monday Dec 5 about 7:30am.
This outage is to allow the switch serving
the henry2 cluster to be upgraded. Outage
is expected to last about 10 minutes.
13-Nov-2005 /share file system on henry2
-
The /share file system will be unavailable for
10-15 minutes between 8pm and 8:30pm on Sunday
evening. This down time is needed to allow for
maintenance on the disk array serving /share.
LSF jobs running from /share could abort when /share is taken off line. Uses planning to run jobs over the weekend may want to run from /share3 instead of /share.
7-Nov-2005 Power5 system
-
Power5 system network connections to internal
HPC network are down. This is resulting in
/home, /usr/local/apps, /ncsu/volume1, and
/ncsu/volume2 being unavailable. Comtech has
been notified of the problem.
6-Nov-2005 henry2 cluster File server for /share file system had hung and had to be rebooted. LSF jobs running from /share were lost.
3-Nov-2005 henry2 cluster and power5 system
-
File server for /home file system had to be rebooted
to clear lots of hanging processes on login nodes.
22-Oct-2005 henry2 cluster
-
File system on management node holding LSF filled
overnight. This caused LSF to stop.
LSF was moved to a new, larger file system.
Jobs submitted after the old file system filled would have been lost.
Please send email if any problems or unusual behavior are observed with LSF on the cluster.
06-Oct-2005 /ncsu/volume[12] mass storage
-
Mass storage file systems are again available from
HPC login nodes.
During the next couple months the mass storage file systems will be migrating from a single server to a server for each file system. During this transition there will be some periods of time that the file systems will be read-only. These read-only periods will be announced in News, Sysnews, and login banners.
Once multiple servers are in place the mass storage system will be much less likely to be offline from a single component failure.
05-Oct-2005 /ncsu/volume[12] mass storage
-
Server for /ncsu/volume[12] is not responding.
Server has had a hardware failure, it is being repaired.
03-Oct-2005 LSF licenses
-
There are ongoing issues with LSF licenses
on the power5 system.
At this time the power5 system is being returned to friendly user mode due to issues with batch processing.
We are working with the LSF vendor to resolve these issues as quickly as possible.
We very much regret the inconvenience this license problem is causing power5 users.
23-Sept-2005 LSF licenses
-
Renewal LSF licenses were installed yesterday.
This morning there were problems with LSF having
the correct licenses for scheduling parallel jobs.
The version of LSF running on the henry2 Linux cluster was updated from 5.1 to 6.1. Users should log off and back on the cluster before submitting jobs to ensure that their environment is correctly configured for the new LSF version.
19-Sept-2005power5 system
-
As of Monday September 19 the power5 system is
in production operation.
The system supports large memory (up to 32GB of physical memory) jobs using up to 8 processors.
Fortran, C, and C++ compilers are available to build user applications. MPI or OpenMP parallelization are supported on the system.
For more information regarding the power5 system see the power5 "How to" page: http://hpc.ncsu.edu/Documents/SharedMemory/GettingStartedp5.php
12-Sept-2005 login.hpc.ncsu.edu
-
The load balancer for login.hpc.ncsu.edu will be changed
at midnight Monday 12 September. Open sessions will be
dropped. However, new ssh sessions should be immediately
available through the new load balancer.
31-Aug-2005IBM p575
-
The power5 system will be unavailable on Thursday
September 1. The system is being relocated in the
data center following the removal of the p690
system.
It is expected that the power5 system will be available again late Thursday afternoon.
1-Aug-2005IBM p575
7-July-2005 henry2 software updates
1-July-2005 cluser /home file syste
-
Since the operating system update on the
henry2 cluster there have been a number of
software failures on the server for the
/home file system. Efforts to identify the
cause of these failures have not been
successful. On June 30 the server experienced
three failures. Following the second June 30
failure a new server was configured and
migration of /home to the new server began.
Following the third June 30 failure the migration
to the new server was completed.
Jobs using files on /home may have encountered problems June 30 due to the number of failures and extended period of the third outage as the transfer to the new server was completed.
The new server has twice as much physical memory as the previous server and is running the same Linux distribution and kernel as the cluster nodes (whereas before the server was running a different kernel).
We will continue to closely monitor the server for /home and regret the inconvenience the previous failures have caused.
1-July-2005 IBM p690
-
Production use of IBM p690 ended June 30, 2005.
A replacement system based on POWER5 processors
has been delivered and is expected to be installed
within the next few days. During the transition
to the new system the p690 will remain available,
however, it is no longer under maintenance so
any hardware failures may not be repaired.
Timeline for friendly user access to the new system will be posted once installation is complete.
Output from any jobs run on the p690 during this transition period should be copied off as soon as possible - keeping in mind that the system is no longer under maintenance.
6-June-2005 IBM p690 Replacement
-
The IBM p690 which NC State has operated for the
past two years will be replaced with a new
shared memory computing system. The new system
will be installed in mid-June and it is planned
to retire the p690 at the end of July.
NC State has been paying the annual hardware maintenance costs for the p690. The replacement system has been acquired for approximately the amount that would have been spent renewing the p690 maintenance for another year.
Existing p690 hardware maintenance expires at the end of June. While it is planned to continue operating the p690 through July, any hardware failure during this time would likely not be repaired. Users should be careful to get data off the p690 prompty when runs complete.
The new shared memory system will have two IBM p575 compute nodes each with 8 1.9GHz single-core POWER5 processors and 32GB of memory; an IBM p550 login node with two 1.65GHz dual-core POWER5 processors and 8 GB memory; and 2TB shared scratch space available to all three nodes using IBM's general parallel file system (gpfs).
2-June-2005 henry2 /home file system
-
The /home file system for henry2 cluster is currently
off line. This is causing login attempts to hang.
Working with server to identify cause of this recurring issue. /home should be back online by 9am.
24-May-2005 henry2 /home file system
-
The /home file system for henry2 cluster is currently
off line. This is causing login attempts to hang.
Server for /home rebooted and normal operation has been restored.
12-May-2005 Intel Compilers - Henry2 Linux Cluster
-
The default Intel compiler version has been
updated from 7.1 to 8.1
This affects the compiler version obtained using the 'add intel' command.
The 8.1 Intel compilers are invoked with different commands than 7.1 - Fortran is ifort and C++ is icpc
7.1 compilers remain available. Use command
source /usr/local/intel/compiler70/ia32/bin/ifcvars.cshto access 7.1 instead of 8.1
8-May-2005 Henry2 Linux Cluster
-
Henry2 linux cluster login nodes were not responding.
File server for home file system was down. The
server was upgraded and returned to service. Login
nodes available as of 10:30am
26-Mar-2005 IBM p690 (mcrae)
-
IBM p690 (mcrae) went down about 12:30 Saturday
afternoon. System was rebooted and returned to
service around 4:30pm. About 9:30pm the system
crashed again.
Access to the p690 was restored about 4pm on Monday (28 March). LSF jobs that were running at the times the system crashed were lost. Jobs waiting in LSF queue were not affected.
8-Mar-2005 /ncsu/volume2
-
Move of SMS completed and production version of
/ncsu/volume2 is again available read/write for
users with space on that file system.
HPC very much regrets the short notice that was provided for this service outage.
7-Mar-2005 /ncsu/volume2
-
Half of the university storage management system (SMS) is
being relocated. While this part of the SMS is offline,
/ncsu/volume2 will be available read only from
the backup version.
Users may find some files in the backup that they previously deleted.
It is expected that the SMS will be back online by Wednesday March 9.
3-Mar-2005 p690 (mcrae) LSF ok
-
A combination of network link updates and license changes
to/at MCNC has caused instability in both LSF and mpiexec.
All issues appear to have been resolved.
24-Feb-2005 p690 (mcrae) LSF Down
-
IBM p690 (mcrae) has lost connection to LSF license server.
LSF is down. Running and queued jobs should not be affected.
New jobs are not being accepted nor new jobs started running.
Working to identify cause for loss of connection. Basic LSF
operation was restored before 5pm.
18-Feb-2005 /ncsu/volume2 Unavailable
-
From about 9pm Friday 18-Feb /ncsu/volume2 will be unavailable
due to maintenance on the university storage management
system. /ncsu/volume2 will be back online by 8am Saturday 19-Feb.
16-Feb-2005 /ncsu/volume2 and /ncsu/volume1 Unavailable from Clusters
-
/ncsu/volume1 and /ncsu/volume2 have been intermittently unavailable
from the cluster today. Access from the p690 has not been affected.
Problem was resolved in network by late afternoon.
12-Feb-2005 /ncsu/volume2 Unavailable
-
From about 9pm Friday 11-Feb /ncsu/volume2 was unavailable
due to maintenance on the university storage management
system. /ncsu/volume2 was back online before 8am Saturday 12-Feb.
2-Feb-2005 /share Unavailable
-
Thursday Feb 3, /share file system will be unavailable
briefly about 8am. The server for /share file system
will be rebooted to bring online additional storage.
19-Jan-2005 Clusters find a new home
HPC Xeon cluster (henry2) and the Opteron cluster (Tim)
have moved to the new Computer Disaster Releif machine
room... Cooool space!!!
16-Jan-2005 Cluster Move Update
HPC Xeon cluster (henry2) was returned to service around 6pm on Sunday Jan 16. LSF queues were restarted a few hours earlier.
Only one login node is currently available for henry2 cluster. Second login node should be available again Tuesday.
Opteron test cluster is expected to be available for use again by end of day Tuesday Jan 18.
13-Jan-2005 /ncsu/volume[12] File Systems
Users will have access to /ncsu/volume[12] during the cluster move starting Friday January 14 (but will need a mcrae account).
28-Dec-2004 University Linux Clusters Moving
-
The university Linux Clusters (henry2 and tim) will
be moved to the new data center. Currently it is
expected the move will begin Friday January 14 and
be completed by Tuesday January 18.
In preparation for the move LSF queues will stop starting new jobs around noon of Thursday January 13 (except for debug queue). Queued jobs that have not started should requeue after the move without problems. Jobs running when the cluster goes down for the move will likely be lost.
Data stored on /ncsu/volume[12] will be continue to be avalable from mcrae.hpc.ncsu.edu.
15-Dec-2004 Opteron Test Cluster
-
A small (4 compute node + 1 interactive node) AMD Opteron
cluster (tim) is now available for testing by friendly users.
The cluster uses IBM e325 dual Opteron servers with 2GHz
processors. The compute nodes each have 9 GB of memory.
Opterons will run x86 binaries and the Portland Group x86-64 compilers are available to develop 64-bit executables.
Contact eric_sills@ncsu.edu if interested in being a friendly user of the Opteron cluster.
10-Dec-2004 henry2 reboot
-
The cluster head node will be rebooted Friday morning to
attempt to clear some issues being observed with file systems.
During the reboot access to /home and /usr/local file systems
will be lost.
25-Oct-2004 henry2 login nodes
-
Login sessions via ssh to the henry2 cluster should
be to login.hpc.ncsu.edu
This will direct the login session to one of currently two login nodes. This avoids a potential single point of failure for the cluster and also permits easy expansion of login nodes if needed to support future use.
Access to henry2.hpc.ncsu.edu has been restricted.
19-Oct-2004 Mass Storage Offline
-
/ncsu/volume1 and /ncsu/volume2 became unreachable from
henry2 around 3:30pm. By 4:30pm these file systems were
also unreachable from mcrae. File server was rebooted and
connectivity restored around 5:30pm.
18-Oct-2004 Perros appointed to NLR Network Research Council
-
Dr. Harry Perros (NC State, Computer Science) has been appointed to the
the National Lambda Rail (
www.nlr.org) Network Research Council. (NRL NRC).
A significant portion of the NLR facilities are to be devoted to research in networking. NLR NRC will provide both guidance to the Board of NLR and to inform the networking community as to this opportunity. NLR NRC to provide input on what are the critical research issues that can utilize the advanced capabilities of the NLR network.
Members are Paul Barford, University of Wisconsin-Madison Dan Blumenthal, University of California, Santa Barbara Javad Boroumand, Cisco Systems Hank Dardy, Naval Research Laboratory Constantinos Dovrolis, Georgia Tech David Farber, Carnegie Mellon University (chair) Gerald Faulhaber, University of Pennsylvania Paul Francis, Cornell University Larry Landweber, University of Wisconsin-Madison and Internet2 (ex officio) Jason Leigh, University of Illinois-Chicago Steven Low, Caltech Mike O'Dell, unaffiliated Phil Papadopoulos, University of California, San Diego Craig Partridge, BBN Technologies Guru Parulkar, National Science Foundation Harry Perros, North Carolina State University
14-Oct-2004 Thom Dunning to lead NCSA
12-Oct-2004 ORNL Positions
-
Oak Ridge National Laboratory has computational
positions available. See
http://computing.ornl.gov/Employment/ for
information.
11-Oct-2004 - mcrae (IBM p690) response slow
-
On Thursday 30 Sept response from IBM p690 (mcrae) became
very slow and system was rebooted. Unfortunately, LSF jobs were
lost during the reboot.
Following the reboot, LSF MPI jobs failed with either license or PJL errors, until configuration changes were made on Monday 3 October.
p690 again began to display very slow response on Saturday 9 October. System was rebooted Sunday 10 October. All running LSF jobs had completed prior to the reboot.
29-Sep-2004 - henry2 again available
- ssh connectivity to henry2 was lost from approixmately 8pm Friday 24 Sept until 10:30am Saturday 25 Sept.
- ssh connectivity was again lost from approximately 12:30-1:30pm on Monday 27 Sept.
- ssh connectivity was again lost from approximately 11pm-midnight on Wednesday 29 Sept.
-
LSF jobs continued to run on compute nodes. Jobs
using /home during the above times should be examined
closely for any problems.
23-Sep-2004 - Parallel Jobs on Henry2 Cluster
-
LSF job scripts that explicitly invoke 'pam'
should revert to using 'mpiexec' command to
execute MPICH jobs.
Due to upcoming LSF license expiration and renewal, there may be periods during which pam will fail with an error message saying 'Node not licensed'. The 'mpiexec' command will be altered as needed during the license transition to utilize the best MPICH execution mechanism available.
6-Sep-2004 - IBM Announces "Open Architecture" for Blades
- Shades of things to come [more ...]
6-Sep-2004 - NC State Virtual Laboratory
- Virtual Computing Laboratory is here [more ...]
6-Sep-2004 - Conferences
- ITD Booth at EdTech -come and visit us [more ...]
- UNC CAUSE - HPC will present [more ...]
- Supercomputing 2004 [more ...]
6-Sep-2004 - Processors added to cluster
-
The Henry2 cluster now has a total of 208 processors.
6-Sep-2004 - HPC and Grid Courses
-
HPC Group is involved in teaching two graduate and one
undergraduate course related to HPC and Grid computing.
11-August-2004 - henry2
-
henry2 was not accepting logins - and has been rebooted.
ALL JOBS MUST BE RUN THROUGH LSF. JOBS RUNNING ON HENRY2 WILL BE KILLED WITHOUT WARNING
Last modified: April 09 2012 17:18:55.