Collin’s Razor: Look at the biggest and smallest results
During a late night conversation at Fluidity Forum, some of us went around and shared our personall razors, the little tricks we keep at hand that can often reduce a big problem into something tractable. Here’s mine:
Collin’s razor: look at the biggest and smallest results of your formalism before using it; if the real world implications of those results don’t make sense, you need a new formalism.
In my day job, the formalisms I’m making are usually reports or charts - a small bit of business logic trying to capture something about the real world, like “how many COVID-positive patients are on each ward?” or “how often do we get recurrent callers to our call center?” It’s much easier to write code that compiles than it is to write code that’s meaningful, so I end up with a little bit of SQL that may or may not be true. And how to check it? Well, I nearly always ignore any sort of complicated statistics and go straight for old reliable: find the most relevant column to sort on, and look at the biggest and smallest values. While my final product will nearly always be an aggregate measure, I won’t compute that aggregation measure before looking at the raw data.
What if the biggest value is way larger than anything else? Should it be? Sometimes this will find a data entry error, and you need to correct it for your results to matter. Sometimes this finds a definitional error: “Oh, there are more patients on this ward than it has beds - I see now that I neglected to filter out past stays that aren’t active currently.” Sometimes this finds a definitional oddity that isn’t exactly an error, but informs what you ought to be asking instead: “Hmm, do I want to count a case where you’re throttled as a repeat call?” And most interestingly, sometimes it finds something very far out of distribution where you genuinely don’t know the story: “Wait, why did this one person lose so much weight on that diet?”. And of course, this all works the other way around too. Especially when you’re investigating something involving averages, “how many zeroes are getting in to the final result?” can often influence averages more than anything else.
If your formalization is going to be valid, it needs to be valid the whole way through, and that means that you need to hold any outliers in your hand and decide what you’re doing with them as a class, not just arbitrarily carve them out. Techniques like winsorizing are acts of statistical cowardice and capitulation, unable to resolve the data to real-world implications because you know in your heart there aren’t any, just trying to get something mediocre enough that you can produce a result that won’t surprise anyone. To understand something in the aggregate, you have to understand what it means for an individual instance. If you don’t have time to see them all, make sure you see the most extreme ones.