We have been asked to monitor the performance of a running batch job. The Batch Job is running on the Jeus Application Server and is running on a 48 core HP UX server. The batch job in question has about 1500 threads. The exceptions that has occurred the maximum is NumberFormatException. The Batch Job does not terminate though and it continues to run.
While monitoring using HPJmeter, I noticed that there are thousands of exceptions being thrown. NumberFormat is just one of the more frequent ones, but there are a lot more. I have the following questions:
- Is this indicative of bad design/coding?
- Does the application server usually handle a lot of exceptions and not report them?
- Does this affect the performance of the running applications? ( There were around 11000 exceptions thrown in around 45 minutes of running )
- Yes, especially since the developers have not gone through and at least wrapped these up in custom exceptions. Otherwise, they should be outputted to a log file as warnings. There's a reason logging libraries exist.
- If the exceptions are real, then it could either be due to the code being broken or the dataset changing. I'd recommend tracing at least one job to understand why the errors are occurring. Having worked with petabytes of data in a job before, I understand how frustrating that can be, but you'll have hell to pay later if the output of this job is then consumed later and causes you problems.
- If the compute path that is throwing the exception is relatively light, then the IO and function calls from an exception will cost a lot compared to any computation. However, given that you only got 11k exceptions in 45 minutes, that's 4 a second. Certainly this is bad, but assuming no other applications are also performing a lot of IO, then this will not block your job too badly.