Testing vs measurement

A recent Ask HN: Why the Linux Kernel doesn't have unit tests? has some good wisdom in the comments.

While tests are great to have, a good and underrated integration testing system is just for someone to run your software. If no one complains, either no one is using it, or the software is doing its work. Do you really need tests for the read(2) syscall when Linux is running on a billion devices, and that syscall is called some 10^12 times per second globally?

In applied machine learning settings, it can be convenient to break things into two related questions:

Does it "work" end to end? Absence of failure, ball-park response correctness, etc. can usually be somewhat automated.
Does it improve the objective? Are these results better on average than the baseline? This requires some evaluation and metric.

If you're only doing one of these or using them at cross purposes, i.e. using model performance measurement for testing, or integration tests for quality, you're probably going to have a hard time.

"Systems people" can prefer the predictability of test: code is written, tests pass, ergo it must be better. "ML people" are guilty of the same mindset: number moves up, therefore there are no bugs.¹

As is usually the case, a balance, and in moderation, is likely the best strategy.

Back to unit tests, they certainly have their place, but this comment captures the challenge:

The hard part in adding unittests is deciding what a unit is, and when a unit is important enough that it should have its own battery of tests. Choosing the wrong boundaries means a lot of wasted time and effort testing things that likely won't break or change so fast that put a drag on refactoring.

While it might be satisfying, superficially reassuring, and OKR-friendly to see code coverage north of 80%, it doesn't guarantee that the effort has been well spent. Perhaps the best long-term measure of value of a unit test is the number and complexity of future issues it prevents. The extreme case of this: if a test never fails, was it needed in the first place? It depends.

One thing is for certain: if some logic takes a long time to get right, and might need adjustment or improvement in the future, this is where agility comes from and your future self will thank you.

If you have a good set of unit tests that test all observable behaviors of a system means you can optimize the logic inside and constantly rerun those tests to make sure it's not broken. This speeds up the experimentation process and clearly delineates the interface so that you can see when and how you can "cheat" on things that aren't observable.

Overall, a nice summary:

"Write tests, not too many. mostly integration". (Guillermo Rauch tweet)

These are stereotypes. In practice, applied ML engineers can wear multiple hats and consider both cases. But often we have a favorite hat. ↩