When we deal with multiclass or multilabel problems, the data is treated as binary classification problem (either one-vs-one or one-vs-rest). There’re several ways to average binary metric values across the set of classes. Which we discussed here is micro and macro method.

"micro" gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.

"macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.

The abbreviation I used below:
TP: True Positive, FP: False Positive, TN: True Negative, FN: False Negative

The definition of precision, recall are:

Assume we have two classes, we use subscript 1, 2 to denote them.

Then the macro precision will be:

Then the micro precision will be:

If the data set is unbalanced, it’s recommended to choose micro score, because macro score gives even weight to each class, and don't take the sample size into consideration.



Reference: