Skip title Accessibility statement: we seek to make the HPC web pages accessible to all users. If you encounter accessibility issues with HPC web pages please send a description of the problem by email to eric_sills@ncsu.edu - thank you.

High Performance and Grid Computing
   
Skip menu side bar
Home
About

OpNews

Help/Accounts

Staff

Partners

User Projects


Services

Hardware

Software

Grid

Monitor


HowTo/FAQ

Docs & Pubs

Courses

Other Resources

 Operations News ... Last modified: September 04 2008 15:29:00.

8 September 2008 p5login

    p5login.hpc.ncsu.edu will be briefly unavailable Monday morning to change its public IP address.

28 August 2008 /ncsu/volume1

    File system /ncsu/volume1 filled to 100% capacity. File server became unresponsive.

    The file server has been rebooted, however, the file system remains unavailable from login nodes while files are migrated to tape to restore free space in the file system.

    We expect the file system to be available from login nodes by 5pm Friday 29 August.

29 June 2008 henry2

    HPC henry2 cluster was unavailable as a result of the data center power outage.

    The power outage left the cluster in a very confused state with nearly all file systems in degraded states.

    Access was restored to login nodes about 6:30pm and to LSF queues about 7:15pm. A set of compute nodes, including nearly all the 32-bit nodes remain unavailable. We will resume work on restoring these to service tomorrow.

    We very much regret this interruption of HPC services.

13 May 2008 /ncsu/volume1

    Mass storage file system /ncsu/volume1 is not currently available. The file sysem filled and is currently being repaired. All files appear to be intact, but the hierarchical storage management software needs some time to clean up the file system without any additional writes. Expect the file system to be available again late on Wednesday 14 May.

5 May 2008 Blade Center logins will be disabled.

    Maintenance was performed on the BladeCenter cluster (henry2) beginning at about 8am. System was available again about 6pm. A number of file systems and services were migrated to new servers during this maintenance and new login nodes were installed.

    Following the maintenance login.hpc.ncsu.edu connects to a set of login nodes running 32-bit Linux (as before, but with new quad-core nodes) and login64.hpc.ncsu.edu connnects to a set of login nodes running 64-bit Linux (instead of a single node - also now quad-core nodes).

30 April 2008 VCL HPC nodes

    VCL HPC nodes were down while a switch was reconfigured. They are back up (Thursday, 1 May)

28 January 2008 GPFS

    GPFS file systems will be unavailable from 8-10am to allow for configuration changes associated with the increase in the size of the /gpfs_share.

26 November 2007 /gpfs_share

    In conjunction with the move of /home - since that will effectively make the cluster unable to process jobs - /gpfs_share will also be migrated to arrays with new disks. Following this migration the capacity of /gpfs_share will increase from its current 4TB to approximately 16TB. Per group quota will remain unchanged for now.

25 November 2007 /home

    /home file system will be moved to a disk array with new disk drives. This move is in response to the drive failures that occurred in mid-September.

    During the move /home will be unavailable on both the Linux cluster and the POWER5 system.

    In preparation for the move LSF queues will stop scheduling new jobs late Saturday November 24. Sunday November 25 the current /home will be unmounted and a final backup done. The new /home will then be mounted and the contents restored from backup.

    It is expected that /home will be available again by noon on Monday November 26.

1 November thru 5 November 2007 /ncsu/volume1

    The tape library which serves as the 2nd tier storage for /ncsu/volume1 will be upgraded between 2-6pm on Thursday November 1. During this time, files which have been migrated to tape will not be available.

    The expansion was not successful. The library is operational and is able to complete daily backups. However it is not able to access new tapes.

    /ncsu/volume1 remains off line. The file system is over full and the HSM software is working to free disk space and perform maintenance on the file system.

    Following firmware upgrade Friday Nov 2 hardware issues were identified. New gripper and scanner have been ordered and should be installed on Monday Nov 5.

    /ncsu/volume1 is again available from HPC login nodes.

30 October 2007 login.hpc.ncsu.edu

    Resolution of the name login.hpc.ncsu.edu changed from a load balancer to round robin DNS. Access will continue to be distributed between two 32-bit login nodes for the BladeCenter Linux Cluster (login01 and login02). Only the mechanism for this distribution was altered.

28 September 2007 /ncsu/volume1

    /ncsu/volume1 file system completely filled overnight. Due to the nature of this file system (hierarchically managed) some amount of free disk space is essential for it to operated. Currently the file system is unusable and has been unmounted.

    Some files will have to be removed from the file system before it will be usable again. If you have any large files on /ncsu/volume1 that you have another copy on your own system that we could remove from /ncsu/volume1 please let us know.

13 September 2007 /home

    /home file system is again off line. IO errors caused the file server to remount the file system read only. To correct the errors the file system had to be taken off line. File system check is being done to try to correct any file system errors.

11 September 2007 /home

    Storage array holding the /home file system experienced a disk failure - the disk was replaced, however, before the RAID data recovery was complete a second disk failed. The second disk failure resulted in loss of data. The file system contents are being restored from tape backup to a different disk array

    Update - 13 September
    The restore from tape took longer than expected due to a very large number of small files in /home. /home was back online at about 9:30pm 12 September.

    Between 10 September and 12 September we have experienced four disk failures in the storage arrays which contain most of the HPC file systems (/home, /share, /share3, /share4, and /gpfs_share). We will be ordering new disk drives (current drives are four years old and out of warranty) and migrating these file systems to the new disks over the next few weeks.

6 September 2007 ANSYS and CFX

    Licenses for ANSYS and CFX (now owned by ANSYS) have been renewed. ANSYS has imposed new a license term which must be individually accepted before access can be granted to ANSYS or CFX. Visit the software link from the HPC home page (http://hpc.ncsu.edu/) and the select the button by ANSYS and CFX to request access.

    ANSYS license also allows only a single version of ANSYS to be in use on campus. That version is now version 11. Default CFX version is also being changed to version 11 to be consistient with the ANSYS version.

    Also only 64-bit versions of ANSYS and CFX are currently available. These versions will not run on the default login nodes. Use login03 to access the GUIs for ANSYS or CFX and add "-R em64t" option on bsub commands for jobs using ANSYS or CFX to ensure the job is scheduled on a 64-bit compute node.

2 September 2007 /home

    File server for /home was down. This resulted in login attempts on henry2 cluster hanging and logins on the POWER5 system receiving an error about missing /home file system. File server for /home was restarted.

27 August 2007 POWER5 System

    The POWER5 system will be unavailable due to required hardware maintenance beginning at 8am. The system was expected to be available again by 2pm, however, the hardware maintenance took longer than originally expected. System was available for users again as of 5pm.

11 August 2007 henry2

    henry2 login nodes were again becoming unresponsive. Server for /usr/local/apps was restarted.

9 August 2007 henry2

    Linux cluster (henry2) 32-bit login nodes (login.hpc.ncsu.edu) were unreachable. This appears to have been the result of many large memory jobs being run at the same time on the login nodes. Please do not use login nodes for running jobs. Any resource intensive tasks should be submitted to LSF.

31 July 2007 POWER5

    Both university owned nodes of the POWER5 system were inoperable. General queues on the POWER5 were inactivated until at least one of the university owned nodes could be returned to service.

    As of approximately 6:30pm on 2 August 2007 all POWER5 nodes had been returned to service.

14 July 2007 CFX

    The license for CFX expired July 13. We are working on renewing the license. However, no estimate is currently available for when (or even if) the renewal process will be completed.

17 May 2007 /ncsu/volume1

    /ncsu/volume1 is reporting disk errors and has been unmounted from all login nodes while the problem is evaluated.

24 April 2007 /gpfs_share

    /gpfs_share file system crashed at about 9:30am. The gpfs_share file servers have been brought back online and the file system is availble again on the servers. gpfs is being restarted on all henry2 login and compute nodes.

10 April 2007 Linux clusters

    Overnight the network connections for message passing traffic between nodes were lost - causing parallel jobs running across multiple chassis to end and new jobs attempting to start across multiple chassis to fail.

9 April 2007 /home file server

    OS update for the /home file server was not able to be completed in parallel with the network switch work. Cluster remained unavailable until about 10am while the /home file server is updated to the same Linux version as the cluster login and compute nodes. We regret this extended down time - but feel it was important to get the file server OS updated.

    As a side effect of the OS update, quota information for /home was lost. Quotas will be reset - at a value higher than current use - but not necessarily at the same level as previously.

9 April 2007 Linux clusters down

    From 6-8AM on Monday April 9th the cluster will be down to update the core ethernet switch.

    It is likely that any jobs running at that time will be lost since network connections to storage and between chassis will be disconnected.

    To minimize lost work, queues will be paused Sunday evening (April 8th) to allow as many jobs as possible to complete prior to the network work Monday morning.

21 February 2007 /share

    /share file system is again available read/write from all henry2 nodes. The file server for /share has been replaced. Also the mount options for this file system are now identical to /share3. MPI-IO jobs should use /gpfs_share. MPI-IO will no longer work reliably on /share - however, performance of /share should now be as good or better than /share3.

21 February 2007 henry2 login nodes

    henry2 login nodes (login01 and login02) will be replaced with newer servers (current servers are no longer under maintenance).

    The transition will happen in two steps. The first step occurred on Monday 19 February. login01 was replaced by a newer server running a more recent version of Linux. Please report any issues observed using the new login01.

    Once all issues are resolved with the new login01 node - login02 will also be replaced.

2 February 2007 /home file system

    Responsiveness of /home file system deteriorated over night until this morning logins became completely impossible. Server for /home was rebooted along with both login nodes (which appeared to be the source of the /home problem).

29 January 2007 HPC core network switch

    The core HPC network switch will be rebooted at 8am on Monday January 29. The reboot will take about 5 minutes. During the reboot connections to HPC cluster login nodes, from HPC nodes to HPC storage, and for interchassis MPI communications will be unavailable.
    HPC jobs attempting to access HPC storage or communicate between chassis during this time will fail.
    To reduce the impact of the reboot, cluster queues will not start new jobs after about noon on Sunday 28 January.
    We regret this inconvenience, but the reboot in necessary to apply a security update on the switch. Monday morning was chosen as the time for the upgrade because that is typically the time there are the fewest number of jobs running on the cluster.

23-December-2006 /gpfs_share

    Following the data center power failure, one of four disk arrays used by /gpfs_share failed to recover.
    /gpfs_share was fully operational again about 2pm Dec 24.

20-November-2006 /ncsu/volume1

    There are more than 250,000 migrated files in /ncsu/volume1 file system. After applying a fix provided by the file system vendor all but about 14 files have been repaired - all of these belong to a single user.

    /ncsu/volume1 is again available read/write from HPC login nodes.

    We very much regret this extensive period of being unable to write to this important file system. We are working with the vendor to develop procedures to minimize the chance of future disruptions.

4-October-2006 /ncsu/volume1

    Migrated files on /ncsu/volume1 were not being recalled on demand. File system is available read only from login03.hpc.ncsu.edu while we work with the vendor to correct the problem with migrated files.

9-September-2006 /ncsu/volume1

    A file system error occurred on /ncsu/volume1. The file system was taken offline and repaired. File system was off line from about 2pm Friday until about 7:30am Saturday.

24-July-2006 /share and /share3 on henry2

    File server for /share crashed about 3am and was returned to service about 8am.

    Several compute nodes remained in a busy state trying to access /share3. About 9:30am the server for /share3 was rebooted to free the hanging compute nodes.

24-June-2006 henry2 Linux Cluster

    Following a power outage Saturday afternoon cooling for the HPC henry2 Linux Cluster was lost. Cluster compute nodes were powered down to minimize damage from the resulting high temperatures. LSF jobs running on henry2 at 4pm Saturday were lost.

    Please carefully check results from Saturday jobs to be sure they completed correctly.

    Also as a result of the loss of cooling the server for /share3 failed. /share3 was returned to service about 3pm Sunday 25 June.

11-May-2006/ncsu/volume1

    Server for /ncsu/volume1 has again become unstable.

    Update: 17-May Some file system and NFS settings have been adjusted. File system is currently mounted from login01 only.

25-Apr-2006/ncsu/volume1

    /ncsu/volume1 is unavailable. Server for this file system crashed.
    Update: 5-May Server hardware has been replaced and software reloaded. File system is currently reconciling with the HSM database. File system is available again as of about 3pm 5 May.

1-Mar-2006/ncsu/volume1

    /ncsu/volume1 is now available from new server and disk space for reading and writing. This file system is now managed by a hierarchical storage manager that will migrate old, large files to tape. Any access of migrated files will restore them to disk - with some delay as the tape is loaded and read.

15-Feb-2006Storage News

    Two storage enhancements are underway on HPC systems.

    Henry2 GPFS - A GPFS (general parallel file system) instance is being deployed on the henry2 Linux cluster. This is the same type of file system that is used for shared scratch space on the POWER5 system. The cluster implementation currently uses two servers and has about 2TB capacity.

    Testing so far has shown about 3X better performance than the best performance seen with NFS shared file systems (eg /share3).

    If testing continues to go well, disk resources currently allocated to /test_share will be redeployed to gpfs along with another TB of disk to provide 6TB of gpfs space this spring. Eventually we expect that /share and /share3 disks will also be reallocated to gpfs to provide 8TB of gpfs space. Target for this transition is during Spring 2006 exams.

    Group quotas will be enforced on the GPFS file system. Currently the group quota is 1TB. Also, like other shared scratch file systems the gpfs space will not be backed up and will be subject to a periodic purge to maintain free space.

    Mass Storage - Mass storage volume1 is in the process of being migrated to a new server. After migration this file system will be managed with Tivoli Space Manager. This will allow large files which have not been recently accessed to be stored on tape rather than disk, thereby making additional storage space available in the /ncsu/volume1 file system. Actual additional space will depend on compression ratio achieved in storing files to tape, but it is estimated that the current tape library capacity will provide an additional 10TB of mass storage space.

    This will increase mass storage space to approximately 26TB from the current 16TB and is expected to be in operation by the end of February.

6-Jan-2006 Major Maintenance Window

    On Friday January 6 a major maintenance window will be taken to make significant adjustments to HPC systems:
    • Two additional p575 nodes will be added to the power5 system
    • Network switches for power5 system will be upgraded
    • Network switches for mass storage system will be relocated
    Due to these changes the Power5 system will not be available between 6am Friday and 6pm Friday.
    Also, the mass storage directories (/ncsu/volume1 and /ncsu/volume2) will not be available from the HPC Linux clusters (henry2 and tim) from 6am Friday until 6pm Friday.

28-Dec-2005 /ncsu/volume[12] read-only

    /ncsu/volume1 and /ncsu/volume2 will be read-only from Wed Dec 28 through approximately Sat Dec 31 to allow the migration of /ncsu/volume1 to a slightly larger file system.

5-Dec-2005 henry2 network outage

    There will be a brief network outage for the henry2 cluster Monday Dec 5 about 7:30am. This outage is to allow the switch serving the henry2 cluster to be upgraded. Outage is expected to last about 10 minutes.

13-Nov-2005 /share file system on henry2

    The /share file system will be unavailable for 10-15 minutes between 8pm and 8:30pm on Sunday evening. This down time is needed to allow for maintenance on the disk array serving /share.

    LSF jobs running from /share could abort when /share is taken off line. Uses planning to run jobs over the weekend may want to run from /share3 instead of /share.

7-Nov-2005 Power5 system

    Power5 system network connections to internal HPC network are down. This is resulting in /home, /usr/local/apps, /ncsu/volume1, and /ncsu/volume2 being unavailable. Comtech has been notified of the problem.

6-Nov-2005 henry2 cluster File server for /share file system had hung and had to be rebooted. LSF jobs running from /share were lost.

3-Nov-2005 henry2 cluster and power5 system

    File server for /home file system had to be rebooted to clear lots of hanging processes on login nodes.

22-Oct-2005 henry2 cluster

    File system on management node holding LSF filled overnight. This caused LSF to stop.

    LSF was moved to a new, larger file system.

    Jobs submitted after the old file system filled would have been lost.

    Please send email if any problems or unusual behavior are observed with LSF on the cluster.

06-Oct-2005 /ncsu/volume[12] mass storage

    Mass storage file systems are again available from HPC login nodes.

    During the next couple months the mass storage file systems will be migrating from a single server to a server for each file system. During this transition there will be some periods of time that the file systems will be read-only. These read-only periods will be announced in News, Sysnews, and login banners.

    Once multiple servers are in place the mass storage system will be much less likely to be offline from a single component failure.

05-Oct-2005 /ncsu/volume[12] mass storage

    Server for /ncsu/volume[12] is not responding.
    Server has had a hardware failure, it is being repaired.

03-Oct-2005 LSF licenses

    There are ongoing issues with LSF licenses on the power5 system.

    At this time the power5 system is being returned to friendly user mode due to issues with batch processing.

    We are working with the LSF vendor to resolve these issues as quickly as possible.

    We very much regret the inconvenience this license problem is causing power5 users.

23-Sept-2005 LSF licenses

    Renewal LSF licenses were installed yesterday. This morning there were problems with LSF having the correct licenses for scheduling parallel jobs.

    The version of LSF running on the henry2 Linux cluster was updated from 5.1 to 6.1. Users should log off and back on the cluster before submitting jobs to ensure that their environment is correctly configured for the new LSF version.

19-Sept-2005power5 system

    As of Monday September 19 the power5 system is in production operation.

    The system supports large memory (up to 32GB of physical memory) jobs using up to 8 processors.

    Fortran, C, and C++ compilers are available to build user applications. MPI or OpenMP parallelization are supported on the system.

    For more information regarding the power5 system see the power5 "How to" page: http://hpc.ncsu.edu/Documents/SharedMemory/GettingStartedp5.php

12-Sept-2005 login.hpc.ncsu.edu

    The load balancer for login.hpc.ncsu.edu will be changed at midnight Monday 12 September. Open sessions will be dropped. However, new ssh sessions should be immediately available through the new load balancer.

31-Aug-2005IBM p575

    The power5 system will be unavailable on Thursday September 1. The system is being relocated in the data center following the removal of the p690 system.

    It is expected that the power5 system will be available again late Thursday afternoon.

1-Aug-2005IBM p575

  • The new shared memory system has two IBM p575 compute nodes each with 8 1.9GHz single-core POWER5 processors and 32GB of memory; an IBM p550 login node with two 1.65GHz dual-core POWER5 processors and 8 GB memory; and 2TB shared scratch space available to all three nodes using IBM's general parallel file system (gpfs). All shared file systems are also available.

    Please note that the new system and the Henry2 cluster share the /home directory (with p5 home being in /home//p5, and Henry2 contiuning to be in the /home/ directory).

    Also, please note that your userid on p5 is now accessible using your unity password.

7-July-2005 henry2 software updates

  • Totalview debugger - has been updated to version 7.0.0-1
    License permits debugging of parallel jobs using up to 4 processors
  • CFX5 - has been updated to release 5.7.1
    Also the renewed license includes 16 parallel processing licenses - please limit use to no more than 8 tasks for a single job.

1-July-2005 cluser /home file syste

    Since the operating system update on the henry2 cluster there have been a number of software failures on the server for the /home file system. Efforts to identify the cause of these failures have not been successful. On June 30 the server experienced three failures. Following the second June 30 failure a new server was configured and migration of /home to the new server began. Following the third June 30 failure the migration to the new server was completed.

    Jobs using files on /home may have encountered problems June 30 due to the number of failures and extended period of the third outage as the transfer to the new server was completed.

    The new server has twice as much physical memory as the previous server and is running the same Linux distribution and kernel as the cluster nodes (whereas before the server was running a different kernel).

    We will continue to closely monitor the server for /home and regret the inconvenience the previous failures have caused.

1-July-2005 IBM p690

    Production use of IBM p690 ended June 30, 2005. A replacement system based on POWER5 processors has been delivered and is expected to be installed within the next few days. During the transition to the new system the p690 will remain available, however, it is no longer under maintenance so any hardware failures may not be repaired.

    Timeline for friendly user access to the new system will be posted once installation is complete.

    Output from any jobs run on the p690 during this transition period should be copied off as soon as possible - keeping in mind that the system is no longer under maintenance.

6-June-2005 IBM p690 Replacement

    The IBM p690 which NC State has operated for the past two years will be replaced with a new shared memory computing system. The new system will be installed in mid-June and it is planned to retire the p690 at the end of July.

    NC State has been paying the annual hardware maintenance costs for the p690. The replacement system has been acquired for approximately the amount that would have been spent renewing the p690 maintenance for another year.

    Existing p690 hardware maintenance expires at the end of June. While it is planned to continue operating the p690 through July, any hardware failure during this time would likely not be repaired. Users should be careful to get data off the p690 prompty when runs complete.

    The new shared memory system will have two IBM p575 compute nodes each with 8 1.9GHz single-core POWER5 processors and 32GB of memory; an IBM p550 login node with two 1.65GHz dual-core POWER5 processors and 8 GB memory; and 2TB shared scratch space available to all three nodes using IBM's general parallel file system (gpfs).

2-June-2005 henry2 /home file system

    The /home file system for henry2 cluster is currently off line. This is causing login attempts to hang.

    Working with server to identify cause of this recurring issue. /home should be back online by 9am.

24-May-2005 henry2 /home file system

    The /home file system for henry2 cluster is currently off line. This is causing login attempts to hang.

    Server for /home rebooted and normal operation has been restored.

12-May-2005 Intel Compilers - Henry2 Linux Cluster

    The default Intel compiler version has been updated from 7.1 to 8.1
    This affects the compiler version obtained using the 'add intel' command.
    The 8.1 Intel compilers are invoked with different commands than 7.1 - Fortran is ifort and C++ is icpc

    7.1 compilers remain available. Use command

    source /usr/local/intel/compiler70/ia32/bin/ifcvars.csh
    to access 7.1 instead of 8.1

8-May-2005 Henry2 Linux Cluster

    Henry2 linux cluster login nodes were not responding. File server for home file system was down. The server was upgraded and returned to service. Login nodes available as of 10:30am

26-Mar-2005 IBM p690 (mcrae)

    IBM p690 (mcrae) went down about 12:30 Saturday afternoon. System was rebooted and returned to service around 4:30pm. About 9:30pm the system crashed again.
    Access to the p690 was restored about 4pm on Monday (28 March). LSF jobs that were running at the times the system crashed were lost. Jobs waiting in LSF queue were not affected.

8-Mar-2005 /ncsu/volume2

    Move of SMS completed and production version of /ncsu/volume2 is again available read/write for users with space on that file system.

    HPC very much regrets the short notice that was provided for this service outage.

7-Mar-2005 /ncsu/volume2

    Half of the university storage management system (SMS) is being relocated. While this part of the SMS is offline, /ncsu/volume2 will be available read only from the backup version.

    Users may find some files in the backup that they previously deleted.

    It is expected that the SMS will be back online by Wednesday March 9.

3-Mar-2005 p690 (mcrae) LSF ok

    A combination of network link updates and license changes to/at MCNC has caused instability in both LSF and mpiexec. All issues appear to have been resolved.

24-Feb-2005 p690 (mcrae) LSF Down

    IBM p690 (mcrae) has lost connection to LSF license server. LSF is down. Running and queued jobs should not be affected. New jobs are not being accepted nor new jobs started running. Working to identify cause for loss of connection. Basic LSF operation was restored before 5pm.

18-Feb-2005 /ncsu/volume2 Unavailable

    From about 9pm Friday 18-Feb /ncsu/volume2 will be unavailable due to maintenance on the university storage management system. /ncsu/volume2 will be back online by 8am Saturday 19-Feb.

16-Feb-2005 /ncsu/volume2 and /ncsu/volume1 Unavailable from Clusters

    /ncsu/volume1 and /ncsu/volume2 have been intermittently unavailable from the cluster today. Access from the p690 has not been affected. Problem was resolved in network by late afternoon.

12-Feb-2005 /ncsu/volume2 Unavailable

    From about 9pm Friday 11-Feb /ncsu/volume2 was unavailable due to maintenance on the university storage management system. /ncsu/volume2 was back online before 8am Saturday 12-Feb.

2-Feb-2005 /share Unavailable

    Thursday Feb 3, /share file system will be unavailable briefly about 8am. The server for /share file system will be rebooted to bring online additional storage.

19-Jan-2005 Clusters find a new home

    New Machine Room Xeon




    HPC Xeon cluster (henry2) and the Opteron cluster (Tim) have moved to the new Computer Disaster Releif machine room... Cooool space!!!


16-Jan-2005 Cluster Move Update

    HPC Xeon cluster (henry2) was returned to service around 6pm on Sunday Jan 16. LSF queues were restarted a few hours earlier.

    Only one login node is currently available for henry2 cluster. Second login node should be available again Tuesday.

    Opteron test cluster is expected to be available for use again by end of day Tuesday Jan 18.

13-Jan-2005 /ncsu/volume[12] File Systems

    Users will have access to /ncsu/volume[12] during the cluster move starting Friday January 14 (but will need a mcrae account).

28-Dec-2004 University Linux Clusters Moving

    The university Linux Clusters (henry2 and tim) will be moved to the new data center. Currently it is expected the move will begin Friday January 14 and be completed by Tuesday January 18.

    In preparation for the move LSF queues will stop starting new jobs around noon of Thursday January 13 (except for debug queue). Queued jobs that have not started should requeue after the move without problems. Jobs running when the cluster goes down for the move will likely be lost.

    Data stored on /ncsu/volume[12] will be continue to be avalable from mcrae.hpc.ncsu.edu.

15-Dec-2004 Opteron Test Cluster

    A small (4 compute node + 1 interactive node) AMD Opteron cluster (tim) is now available for testing by friendly users. The cluster uses IBM e325 dual Opteron servers with 2GHz processors. The compute nodes each have 9 GB of memory.

    Opterons will run x86 binaries and the Portland Group x86-64 compilers are available to develop 64-bit executables.

    Contact eric_sills@ncsu.edu if interested in being a friendly user of the Opteron cluster.

10-Dec-2004 henry2 reboot

    The cluster head node will be rebooted Friday morning to attempt to clear some issues being observed with file systems. During the reboot access to /home and /usr/local file systems will be lost.

25-Oct-2004 henry2 login nodes

    Login sessions via ssh to the henry2 cluster should be to login.hpc.ncsu.edu

    This will direct the login session to one of currently two login nodes. This avoids a potential single point of failure for the cluster and also permits easy expansion of login nodes if needed to support future use.

    Access to henry2.hpc.ncsu.edu has been restricted.

19-Oct-2004 Mass Storage Offline

    /ncsu/volume1 and /ncsu/volume2 became unreachable from henry2 around 3:30pm. By 4:30pm these file systems were also unreachable from mcrae. File server was rebooted and connectivity restored around 5:30pm.

18-Oct-2004 Perros appointed to NLR Network Research Council

    Dr. Harry Perros (NC State, Computer Science) has been appointed to the the National Lambda Rail ( www.nlr.org) Network Research Council. (NRL NRC).

    A significant portion of the NLR facilities are to be devoted to research in networking. NLR NRC will provide both guidance to the Board of NLR and to inform the networking community as to this opportunity. NLR NRC to provide input on what are the critical research issues that can utilize the advanced capabilities of the NLR network.

    Members are
    
    Paul Barford, University of Wisconsin-Madison
    Dan Blumenthal, University of California, Santa Barbara
    Javad Boroumand, Cisco Systems
    Hank Dardy, Naval Research Laboratory
    Constantinos Dovrolis, Georgia Tech
    David Farber, Carnegie Mellon University (chair)
    Gerald Faulhaber, University of Pennsylvania
    Paul Francis, Cornell University
    Larry Landweber, University of Wisconsin-Madison and Internet2 (ex officio)
    Jason Leigh, University of Illinois-Chicago
    Steven Low, Caltech
    Mike O'Dell, unaffiliated
    Phil Papadopoulos, University of California, San Diego
    Craig Partridge, BBN Technologies
    Guru Parulkar, National Science Foundation
    Harry Perros, North Carolina State University
    

14-Oct-2004 Thom Dunning to lead NCSA

12-Oct-2004 ORNL Positions

11-Oct-2004 - mcrae (IBM p690) response slow

    On Thursday 30 Sept response from IBM p690 (mcrae) became very slow and system was rebooted. Unfortunately, LSF jobs were lost during the reboot.

    Following the reboot, LSF MPI jobs failed with either license or PJL errors, until configuration changes were made on Monday 3 October.

    p690 again began to display very slow response on Saturday 9 October. System was rebooted Sunday 10 October. All running LSF jobs had completed prior to the reboot.

29-Sep-2004 - henry2 again available

  • ssh connectivity to henry2 was lost from approixmately 8pm Friday 24 Sept until 10:30am Saturday 25 Sept.
  • ssh connectivity was again lost from approximately 12:30-1:30pm on Monday 27 Sept.
  • ssh connectivity was again lost from approximately 11pm-midnight on Wednesday 29 Sept.
    LSF jobs continued to run on compute nodes. Jobs using /home during the above times should be examined closely for any problems.

23-Sep-2004 - Parallel Jobs on Henry2 Cluster

    LSF job scripts that explicitly invoke 'pam' should revert to using 'mpiexec' command to execute MPICH jobs.

    Due to upcoming LSF license expiration and renewal, there may be periods during which pam will fail with an error message saying 'Node not licensed'. The 'mpiexec' command will be altered as needed during the license transition to utilize the best MPICH execution mechanism available.

6-Sep-2004 - IBM Announces "Open Architecture" for Blades

6-Sep-2004 - NC State Virtual Laboratory

  • Virtual Computing Laboratory is here [more ...]

6-Sep-2004 - Conferences

  • ITD Booth at EdTech -come and visit us [more ...]
  • UNC CAUSE - HPC will present [more ...]
  • Supercomputing 2004 [more ...]

6-Sep-2004 - Processors added to cluster

    The Henry2 cluster now has a total of 208 processors.

6-Sep-2004 - HPC and Grid Courses

11-August-2004 - henry2

    henry2 was not accepting logins - and has been rebooted.

    ALL JOBS MUST BE RUN THROUGH LSF. JOBS RUNNING ON HENRY2 WILL BE KILLED WITHOUT WARNING


Old News (7/1/03-7/31/04)
 

Last modified: September 04 2008 15:29:00.
Copyright © 2003-2007 by NC State University and others, All Rights Reserved.
HPC & Grid (Version 1.4 / Site access count: 717106) - Site/Content Notice

Site contact: Eric Sills, E-mail: eric_sills at ncsu dot edu , Tel: 919-513-0324, Fax: 919-513-1893, HPC and Grid Operations, Information Technology Division, Box 7109, North Carolina State University, Raleigh, NC27695-7914, USA