Here at Topaz Labs, we train modern AI models to “auto-magically” enhance images and video, and deliver them to our customers in desktop applications that run on their own hardware. This provides a lot of value to our customers, but raises new challenges to our engineers in ensuring reliable software delivery.
We recently released new versions of our Sharpen AI and DeNoise AI applications using a next-generation version of our AI Engine. The AI Engine is the part of our software which takes in the AI model, and processes an image through it using a variety of hardware-optimized libraries. This new version of the AI Engine has resulted in some incredible improvements in speed, but unfortunately the launch of the 3.0 versions of Sharpen AI and DeNoise AI was rougher than expected.
We ran into problems where after the initial release, certain combinations of operating systems, CPUs, GPUs, and drivers resulted in crashes or sub-optimal performance. Our customers posted several issues on our forums, and our support ticket volumes sky-rocketed.Fortunately, our team rallied and isolated a few of the major reproducible issues. And our engineers quickly diagnosed and patched numerous issues in rapid order. However, for many of our customers, the damage was already done: they had gone through an upgrade of our software, and gotten a new version that was worse than what they had had before.The issues in our release products affected our customers across three dimensions:
Each one of those dimensions has a multiplying effect on the negative impact of a given released bug, and unfortunately these releases suffered from larger than hoped-for effects across all three dimensions.
Now that the dust is settling on those recent releases – the team met and we came up with some ways to improve the quality of our product releases in a way that also still enables us to innovate quickly on new enhancements to our products:Better TestingWe recently had engaged a new Quality Assurance partner. But unfortunately we didn’t ask them to test across a diverse enough set of operating systems, CPUs, and GPUs. We also made large changes to our software, and the way that it is installed for our users, after the testing runs had already completed.To address these issues, we’ve updated the testing plans to incorporate 5-times the number of tested types of machines, and instituted a stricter feature-freeze period between testing and release. This will ensure that we catch more issues earlier, and that we minimize the chances that new issues are introduced after testing.
We’ve also adjusted our release schedule, so that products with similar new functionality will go out at different times. This means that if issues with a new common piece of functionality, like this new AI Engine, are encountered, that they one affect one product at a time, and can be fixed for that one product.
One of the new features that we rolled out with the new product installers, is the ability to optimize software updates. This means that when you update a piece of Topaz software, that the installer only downloads the parts of the application which have changed since the last version.We’ve also changed our release processes to enable more “hotfixes”, which represent a fix that the engineers feel is highly impactful for our customers, and has a low risk of introducing new problems. Close followers of past Topaz releases may have seen us do these before, we’re just further enabling the team to be able to get these fixes out faster to our customers.
At Topaz Labs, we feel that some of the best lessons are learned when things do not go well. So we endeavor to foster open and honest communication internally towards learning those lessons as a team. We also aim to keep our promise of high-quality software delivery to our customers, and be honest when we feel we have fallen short.Hopefully others may find some of the lessons we’ve learned useful for their own projects. We’ve certainly been encouraged that we can take steps towards even better releases in the future.