Very short and simple post about metrics used for validating models.

Confusion matrix

A matrix made of:

  • true positives: my test says spam and the email is spam (tp)
  • false positives (type 1 error): my test says spam but the email is not spam (fp)
  • false negatives (type 2 error): my test says not spam but the email is spam (fn)
  • true negatives: my test says not spam and the email is not spam (tn)

For the rest of the article let’s consider some numbers: tp = 10, fp = 100, fn = 1000, tn = 10000


Fraction of correct predictions:

 def accuracy(tp, fp, fn, tn):
    return (tp + tn) / (tp + fp + fn + tn)

accuracy(10, 100, 1000, 10000)
Out[1]: 0.900990099009901

Very high, but obviously the test is crap, as we will see with the following three metrics. The results are completely biased by the true negatives. An example of such a crap test with high accuracy could be ‘you are CEO if your name is Jack’. Obviously, there are a lot of Jacks who are not CEOs.


How accurate our positive predictions are.

def precision(tp, fp, fn, tn):
    return tp / (tp + fp)

precision(10, 100, 1000, 10000)
Out[2]: 0.09090909090909091

Obviously, accuracy is less than 1%.


What fraction of positives the model identified.

def recall(tp, fp, fn, tn):
    return tp  / (tp + fn)

recall (10, 100, 1000, 10000)
Out[3]: 0.009900990099009901

Again, not such a good score.

F1 score

The harmonic average between the recall and precision scores. The harmonic average is a good choice because it gets a score closer to the lowest result.

def f1_score(tp, fp, fn, tn):
    return 2 / (1/precision(tp, fp, fn, tn) + 1/recall(tp, fp, fn, tn))

f1_score(10, 100, 1000, 10000)
Out[5]: 0.017857142857142856