In this post:
- Anecdotes about software metrics gone wrong.
- Software quality studies should embrace subjective measures, and for all I know, they already have.
Okay, this is going to be an especially quick and breezy post even by this blog's standards, because I had a really exhausting week, and I need to ease off on myself to avoid schedule slip.
In my tinkering with Structured Data, I hooked the repository up to some code quality measuring services, and started optimizing. (Granted, some of those scores went up because I looked at the issues they brought up, and went "it's fine" to varying degrees of "would persuade another human". From "those variables are too defined, what's wrong with you" on one end to "anyone who owns themselves with this line of code deserves it" at the other.)
Anyway, rewriting my code to deal with these metrics reminded me of a quote or statement that I can't track down, saying that explicitly trying to optimize for particular metrics runs the risk of decoupling the metrics from whatever they're supposed to be a proxy for.
I've heard stories of companies trying to measure progress in terms of lines of code, which is a pretty bad idea. But there are other measures that can go awry. I remember back in college, I used pylint on some of my code, had it notice some pretty bad code, and promptly rewrote the code in a dynamic enough fashion to defeat static analysis. That's... not how you're supposed to get rid of code smells.
Cyclomatic complexity is too high? Rewrite the function as a bunch of smaller functions!
Statement count is too high? Use method chaining to cram more computation into a single logical line!
Pylint complains about too few public methods? Do anything at all besides disable the check globally.
What I'm getting at is, it's certainly possible to reason about connections between topological measures and system brittleness, but the real measure of a metric's worth is what the code looks like if you optimize it. I'm actually not sure I'm against how my code ended up after I started trying to force the complexity down, but it definitely went in some surprising directions.
I could imagine a study in which developers with varying levels of experience are instructed to optimize a small but useful codebase of some kind according to some metric or other. The results could then be rated by some subjective measures, and this would then produce a distribution for each metric. From that, it would be possible to say, which metrics impose a ceiling on quality, and what kind of floor on quality does each metric have?
It might sound weird to focus on subjective measures, but it's harder to maintain code you don't like looking at, so subjective appraisals of the code are actually an extremely important measure. The code could also be run through benchmarking, but my intuition is that most metrics are only slightly correlated with characteristics that affect performance. I'm not so sure that stuff like defect rates is a useful measure. I mean, in any compiled language, including Python, you'd much rather inspect the source than the compiled code, but they have in some sense identical defects reported against them. Since defects are all but inevitable, it's much more important that the code empower the maintenance programmer.
I don't have the resources to pull something like this off. Someone please run a study using these ideas, thanks in advance.