Only 8 years into running this product and I still learn something new about it.

Monday there was an event. Two nodes became responsive at about the same time. The other ten nodes did their jobs and transferred session information to the nodes taking on the sessions. Most were so busy they did not respond to monitor requests. There was lots of slowness. But we did not lose sessions. Nor did we lose the cluster.

Somehow we did lose the Mail tool. (Think internal email, but it can forward messages to email.)

In WebCT Vista 3 we diagnosed this by going to Weblogic, finding the email queues, and restarting some things to email would start flowing again. I was not able to find it that way. Apparently now, we go to the Background Jobs as a server administrator. The waiting mail jobs show up in Pending Jobs view.

Once I restarted the cluster, the blocking mail job was changed to Retried as soon as the JMS node came online. Retried only shows up in the All Jobs view. All the other views do not show it. Which makes sense because each view shows the status of the view name. So the Cancelled Jobs view only shows jobs with the Cancelled status. Any jobs with a Retried status should only show in the (non-existent) Retried Jobs and (existing) All Jobs views. It was bad assumption on my part that all potential statuses have a view.

Hindsight being 20/20, what we need is a Nagios monitor to detect is Pending jobs exceeds maybe 20-50 jobs. Normally this table appears empty. But I could see cases where it normally grows fast then quickly clears.

But then again, we have less than a year on this product. What are the odds this will happen again?