import numpy as np
def entropy(p):
"""Measure of randomness in Information Theory.
Args:
p (iterable): The probability of success.
Returns:
(float) The information entropy.
"""
return - np.sum([p_i * np.log(p_i) for p_i in p])
Information Theory is a cornerstone of Machine Learning, yet most Data Science practitioners haven’t been formally trained on it. This blog post is meant to give a brief introduction about entropy, a foundamental measure of uncertainty, which is used in many popular Machine Learning algorithms.
Information theory studies the quantification, storage, and communication of information. ― cit. Wikipedia
Entropy is a measure of randomness or, in other words, unpredictability. The higher the entropy, the more unpredictable an outcome is. This also represents the minimum expected code length necessary to encode samples of the random variable. Intuitively, the lower the entropy, the easier is to encode a random variable.
Some popular Machine Learning algorithms ― e.g. Decision Tree, Random Forest, Gradient Boosting, XGBoost ― use entropy to decide when to split a leaf into two leaves.
Let’s look at how entropy works using a very simple example. Let’s say we are playing a game which involves extracting marbles from an urn. Marbles can be only of two colors, either red or blue. The proportion of red and blue marbles is 70%-30%. How can we measure the uncertainty of this game? Entropy, can come to our rescue.
Entropy, usually represented with the letter $ H $, is mathematically defined as:
\[ \begin{align} H(X) &= \sum_{i=1}^N P(x_i) \log P\left(\frac{1}{x_i}\right) \\ &= - \sum_{i=1}^N P(x_i) \log P(x_i) \end{align} \]
In our example, this would be:
\[ \begin{align} H(X) &= - \sum_{i=1}^{2} P(x_i) \log P(x_i) \\ &= - P(x_i=blue) \log P(x_i=blue) - P(x_i=red) \log P(x_i=red) \\ &= - .30 \log 0.30 - .70 \log 0.70 \\ &= .610864 \end{align} \]
If you find it easier to read code than maths, we could define the entropy in Python with this simple function.
Then, we could open a Python interpreter and execute the following commands to verify the results.
= [.30, .70]
p entropy(p)
0.6108643020548935
We haven’t yet mentioned the unit of measure of entropy, the “bit” (alternatively called “shannon”). The unit of measure of the entropy shouldn’t be confused with the binary digit.
Confusion often arises because the words bit and binary digit are used interchangeably. But, within information theory, a bit and a binary digit are fundamentally different types of entities. A binary digit is a number that can adopt one of two possible values (0 or 1), whereas a bit is the maximum amount of information that can be conveyed by a binary digit. By analogy, a binary digit is like a container, whereas information is the amount of matter in the container. ― cit. Wikipedia
It should also be noted that, when $ P(x_i) = 0 $ for some $ i $, the value of the corresponding summand $ 0 (0) $ is taken to be $ 0 $.
In the next blog post I’ll show some applications of entropy in Machine Learning and Bayesian inference.