Admin Alert: Keeping i5/OS Ethernet Lines Connected
November 4, 2009 Joe Hertvik
Many standard configuration i5/OS Ethernet lines have a common problem. If the line loses connectivity for any reason, the controllers and devices attached to that line will go into a recovery pending state (RCYPND), which requires manual intervention to restart communication. I ran into this problem a few weeks ago and found two solutions for resolving the issue and automatically recovering communications after an outage.
What Happened To My Communications Line?
I experienced this issue when my network services people took down our core network switch for maintenance one Sunday morning, killing Ethernet and TCP/IP connectivity for all network users. After several minutes without communications, the controllers and devices connected to our System i Ethernet lines went into recovery pending state (RCYPND), and they were unable to communicate with the outside world. To restore connectivity, we had to reset all the Ethernet lines after the router came back.
The situation baffled us. Why didn’t TCP/IP remain active over our comm lines? Other network servers, including Microsoft Exchange, Windows file servers, and some Novell servers, started communicating again when the router was restarted. Why didn’t our System i partitions do the same?
The answer resided in a little known i5/OS configuration parameter for Ethernet lines called the Communication Recovery Limits (CMNRCYLMT). You can find this parameter by running this Display Line Description (DSPLIND) command.
DSPLIND LIND(ETHERLINE) OPTION(*TMRRTY)
This command will show you a display that looks something like this:
Display Line Description 10/26/09 20:23:24 Line description . . . . . . . . . : ETHERLINE Option . . . . . . . . . . . . . . : *TMRRTY Category of line . . . . . . . . . : *ELAN Recovery limits: Count limit . . . . . . . . . . : 2 Time interval . . . . . . . . . : 5
These parameters specify second level recovery limits for your line. First level limits deal with the line’s inactivity timer; second level limits specify what happens when the connection is entirely lost. The Count Limit parameter specifies how many attempts the line will make to recover connectivity during the number of minutes listed in the Time Interval parameter. The default values for new Ethernet lines are two tries for the count limit and five minutes for the time interval. When a communications failure occurs, your Ethernet line will attempt to re-establish connectivity twice within a five-minute span before going into recovery pending state (RCYPND). The only way to emerge from RCYPND is to manually reset the target line, controller, and device descriptions.
It’s important to note that lines configured this way will fail whenever Ethernet communication is cut off for a reasonable period of time (greater than five minutes). I don’t know what IBM was thinking when it set up its standard Ethernet configuration this way, but it doesn’t work well during a network outage.
How To Fix the Issue
To solve this problem, we ran the following Change Line Desc (Ethernet) (CHGLINETH) command from a 5250 green-screen:
CHGLINETH LIND(ETHERLINE1) CMNRCYLMT(15 1)
CMNRCYLMT changes take affect immediately, so we didn’t have to restart the line to enact the change. After running this command, we viewed our new recovery limits by once again running this DSPLIND command:
DSPLIND LIND(ETHERLINE) OPTION(*TMRRTY)
By entering *TMRRTY in the Option (OPTION) parameter, we could view the Ethernet line’s recovery limit values on this display:
Display Line Description 10/26/09 20:59:37 Line description . . . . . . . . . : ETHERLINE Option . . . . . . . . . . . . . . : *TMRRTY Category of line . . . . . . . . . : *ELAN Recovery limits: Count limit . . . . . . . . . . : 15 Time interval . . . . . . . . . : 1
Two weeks after the change, our network services people took down the core router again to complete their maintenance. This time our Ethernet line waited and re-established its connection without manual intervention after the router came back up.
So What Do These Settings Do?
CMNRCYLMT basically jams your Ethernet line’s second level retry configuration and keeps making recovery attempts until it can re-establish connectivity. By changing the Count limit to 15, we told i5/OS to make 15 recovery attempts before giving up. By setting the time interval to one minute, however, it becomes impossible for the system to make 15 recovery attempts in that short a time. When the line can’t attempt 15 recovery attempts in a minute, it tries again during the next minute. When that fails, it tries again in the next minute and so on until it successfully re-establishes connectivity. We made this change on an IBM recommendation and it worked great the first time we needed it.
But If That Doesn’t Work For You. . .
Interestingly enough, this isn’t the only way to prevent Ethernet lines from going into RCYPND state. There is another way to set your CMNRCYLMT parameters so that your line will infinitely try to restore connectivity. Be warned that I haven’t tested this configuration yet, so I can’t guarantee that it will work. However, this configuration is documented on several Internet sites so if the first technique doesn’t work, try this configuration to see if it does the trick.
Run the Change Line Desc (Ethernet) (CHGLINETH) command as follows:
CHGLINETH LIND(ETHERLINE1) CMNRCYLMT(5 0)
Again, CMNRCYLMT changes will immediately take effect. When you run the DSPLIND command, your parameters will now look like this:
Display Line Description 10/26/09 20:59:37 Line description . . . . . . . . . : ETHERLINE Option . . . . . . . . . . . . . . : *TMRRTY Category of line . . . . . . . . . : *ELAN Recovery limits: Count limit . . . . . . . . . . : 5 Time interval . . . . . . . . . : 0
When CMNRCYLMT is set up with a zero in the time interval parameter and any number is entered in the count limit parameter, the zero value specifies an infinite retry and recovery interval for your line. This configuration should also allow your lines to reconnect after a network outage or when someone does something stupid, like unplugging your Ethernet cable. Again, I haven’t deployed this configuration yet, but there’s no reason to think that it won’t work. Test before deployment to ensure you won’t have any problems.
About Our Testing Environment
Configurations in this article were tested on several System i partitions running i5/OS V5R4. These parameters can also be found on pre-V5R4 i5/OS versions and they are also available in the i 6.1 operating system. Due to time and machine constraints, we did not test the second recovery limit example (where you set CMNRCYLMT to 5, 0) but I believe that technique should work in practice.