SDK Design Goal #7: Design for Troubleshooting

This is a eights article in the SDK Design Goal series. Please see the introduction article “How to present the licensed technology the right way?.

No matter how good your SDK is, and how easy to integrate you made it, some licensees will still encounter issues during the integration. Those issues, ranked by the occurrence rate, would fall in one of the following categories:

  • Invalid SDK usage (for example incorrectly installed or configured SDK, invalid API usage, incorrect license used);
  • Environment issues (for example, lack of permissions to open the requested file);
  • Lack of required system resource issues (lack of memory, disk space);
  • Bugs in the SDK triggered by otherwise valid usage;
  • Baseline issues (such as hardware issues, corrupted or infected operating system, etc).

There is little you can do for the last two categories. But the rest of the issues the SDK should handle as gracefully as possible. This means the licensee should know not only that the SDK cannot perform the requested operations, but also why the SDK cannot do that.

Why is this important? Because for the top three issues the licensee can fix by themselves, without involving your support. Not only it saves your time and resources, but it also speeds up the integration considerably. If you don’t see why, please consider the following example:

Your SDK by design requires all passed file names to be absolute paths. This is documented in multiple places in your documentation, so you expect the licensee to know that and never pass the relative path. However, for some unknown reason, one of your licensees during evaluation passed the file to the SDK by relative path. Maybe they forgot, maybe they didn’t read the documentation carefully enough, or maybe they had code using their current SDK and just switched the function calls. Anyway, the requirements are violated and thus the SDK fails.

But how exactly does it fail? A poorly designed SDK would return something like INTERNAL_ERROR – a meaningless (to the licensee) error code. Such code is typically documented in a similarly meaningless way such as “something happened, contact support”. So the licensee decided to contact your support. Because your support procedure is well-documented, it only took them 15 minutes to find out the contacts, and write a proper bug report. Your support luckily had no other issues, and was able to review the email right away. They requested a test case (this took another 10-15 minutes). The licensee prepared the test case (another 30 minutes to strip down all proprietary code and ensure the case still returns an error). Support analyzed it, debugged it, and found the cause. The licensee was notified that an absolute path was required. They changed the code, and everything worked.

Now, the whole interaction took at least an hour, and involved at least 3 back-and-forth emails. More, this interaction would likely be remembered, and the integrator would characterize your SDK as “we had some problems during integration which eventually got resolved”.

However imagine if the SDK instead returned the ERROR_ABSOLUTE_PATH_REQUIRED? Most likely the licensee wouldn’t even need to go to the documentation to understand what happened, and what needs to be done to fix it. It would take one minute to fix the issue, which would save time and effort of everyone involved. Also being such miniscule, this issue would not even likely to be remembered, and the integration experience with your SDK would be described as “went smoothly” – and this would be a very valuable reference!

The above was actually a real life example, and even with this time wasted it went relatively well, because the licensee actually took some effort to find out what happened. However not all of them will do that. From my experience, the licensees experiencing integration problems would fall in one of following categories:

  • Around 10% would write off your SDK as “doesn’t work”, and abort the evaluation right away. Some will let you know, but will only offer the generic explanation such as “we decided to use another vendor”, and the rest will not even get back to you at all.
  • Around 30% will email your technology support right away, saying “SDK returned internal error, we don’t know what it means”. If your SDK crashed, this number would be probably higher.
  • Around 40% will do some debugging, and will email your technology support after they done the debugging and saw no obvious faults (the file name passed to the SDK is a valid string, in valid encoding, the file exists and seem to be accessible). However they still don’t know why it happens.
  • Only 20% of your customers would go through all the steps above, then go to the SDK documentation, find the documentation for this specific function, find out that it requires an absolute file name – this might be difficult if it is just a single statement in the three pages of text – understand that what they pass is not an absolute file name, and correct it. And even those customers would feel unhappy, feeling that if this is such a major requirement for you, it would certainly be worth a dedicated error code such as PATH_NAME_NOT_ABSOLUTE.

Thus it is extremely important that your SDK is troubleshooting-friendly AND that the customers using your SDK could troubleshoot most of the issues themselves, without involving your support. Not only this creates less issues for you, but also it makes it much easier – and faster! – for the customer as well.

To help your licensees – and yourself – please:

  • Make sure each function, which could possibly result in error, return an error code for any erroneous situation which happened instead of crashing or proceeding with meaningless arguments (such as trying to calculate a square of a negative number);
  • Make sure those returned error codes are meaningful to the licensee, and if the issue is caused by their error (such as missing file or lacking permissions), they could understand what should be done to fix the issue. For example, ERROR_DRIVER_NOT_LOADED is meaningful, while ERROR_C001 is not.
  • Implement the logging option in your SDK, which would force it to write the debug logs explaining what happened inside the SDK. Make those logs meaningful and readable by the licensee (i.e. don’t encrypt them), as in many cases the licensee can read the logs and fix the issue themselves. Especially try your best to have meaningful error messages in logs, saying not just “couldn’t open file”, but include the reason why the file was opened (“couldn’t open the passed file”), the reason the file couldn’t open (“path not found”), and the system error code if known (“errno 5”). Remember, people who integrate your technology are engineers too, and in many cases they can figure out what’s wrong simply by looking at logs.

Finally, it is possible that the issue was caused by the bug in the SDK. This happens too; please do not panic. Bugs are unfortunate reality of modern software, and none of your competitors has an SDK which is completely bug-free. What matters is how fast you can find out and fix them.

This entry was posted in Uncategorized.

Comments are closed.