Production Readiness

 

IT Ultra’s Officer and Director level IP Professionals have developed Siebel verticals.  As a part of developing these verticals, production readiness is a necessity.  No product was released until it passed Quality Assurance in the areas of Functional testing, performance and scalability.  From a product standpoint, Siebel ensures that the product performs and scales out of the box.  Issues do arise when customization and configuration occurs.  All of these customizations and configurations need to be tested for functional performance and scalability.  This is where IT Ultra can help.

 

After running Siebel verticals our members moved to the performance and scalability team.  In which we helped to design the Siebel High Interactivity (HI) client, network protocol, scalability test scenarios, and the internal testing infrastructure for Siebel. 

 

We apply these techniques to implementations of Siebel to ensure the same level of stability as the application moves from development to production.  This is often referred to as production readiness.

 

Below are a number of our customer experiences.

Deutsche Telecom is Siebel’s largest implementation.  The test plan was to scale to over 25,000 concurrent call center agents.  We assisted IBM and Oracle / Siebel to refine the test and QA strategy.   We were asked to assist with performance problems and it was determined that, query workflows were running over 5 minutes.  Thus the application was in an unacceptable state. Oracle Database experts and Siebel Expert Services were having difficulty determining the problem.  IT Ultras resources were able to determine the problem within days.  Within 1 month most of the performance issues were resolved down to less than 1 minute.  We accomplished much of this work off-site, with the use of remote access and daily reporting.  We took a methodology of analyzing system performance over various runs of load runner, looking at the database, SARM, and network reports.  The result was a quick turnaround of performance.  In a matter of days performance was improving.  The situation is now that IBM at Deutsche Telecom is using similar scripts that we wrote to continue analysis of the production environment.  Over 2 years later, the worst performing query in production is 37 seconds.  This performance is over 16,000 concurrent call center agents with > 200 million accounts.

In addition to the improved performance effort, we were able to assist and guide IBM to develop an automated federated testing infrastructure.  There was great coverage with the IBM / Deutsche Telecom testing strategy however, it was not automated.  They utilized a number of load runner scripts, but the running and set up of the environment was extremely manual.  We worked with IBM to design the automated and federated the testing environment.  This Testing environment consisted of a central orchestrator / coordinator server.  This server decided what processes needed to be in place and request that a remote machine run that process.  A sample run would be:

  • Install / reset a database --> this needs to run on AIX server
  • Install Siebel environment --> multiple AIX machines
  • Run sanity check --> windows load runner
  • Run load test --> multiple windows machines were needed to bring load to 20,000 concurrent users
  • Gather results and analyze --> AIX boxes for database and Siebel Servers, Windows for load runner

The system needed to coordinate between multiple operating systems and functions.  IBM is now productizing the system.

Whirlpool – Whirlpool had a recall that caused their web systems to be flooded with requests consequently, they could not service their customers.  The IBM DB2 team was looking at SQL parsing issues.  The problem was actually a resource issue in database which we analyzed in less than 1 hour.  With further investigation determined that the Siebel connection / login was the causing a resource lock on the DB2 user table.  The table / file was pinned to memory resulting in the elimination of the database bottleneck.

Once the crisis was over a closer look could be taken at Whirlpool’s systems.  It was determined that there was no throttle on web access.  As additional users queried Whirlpool’s systems for recall and registration information, the burden on the system grew linearly.  The flood of web activity during recalls limited system resources available for call center agents.  This situation made it nearly impossible for call center agents to service the customers.  The web access used web services in a stateless mode.  This means that each request is a login to Siebel.  Thus when the next recall happens, there will be a flood of login requests causing the same condition that caused the crisis.  Connection pooling to the Siebel server environment is being implemented.

In addition to the connection pooling, there were 10 minute queries being run and those are being addressed.

Juniper – IT Ultra resources received an AWR from Juniper Networks after Oracle OCS performed a Siebel Review and determined that they did not see anything wrong with the environment.  After looking at the AWR report IT Ultra determined that there were significant database issues.  There were a number of long running queries, the top 12 were 3 hours down to 30 minutes.  There were a number of queries that were running > 30,000 times and hour.  The greatest of which were running > 72,000 times an hour.  Thus it was determined that there is a bottleneck in the database.  The screen transition times were > 45 seconds in the call center.  A number of Siebel configuration changes can be implemented to resolve these issues.

3Com -- Utilized Oracle’s BI framework to provide regional reports for sales.  The system in place was effectively never returning due to query times.  The queries ran > 15 hours.  After many iterations of query tuning the worst query was < 3 minutes.  That query could be improved with Schema changes that we predict would change the query time to < 5 seconds.  It was determined that since this is a data warehouse application, the 3 minute mark was sufficient.