How do you determine the root of a decision tree?

While I have briefly covered decision trees before, I’d like to go a little bit more in depth about how particular nodes are chosen, in this case the root.

Recall that the earlier you are in your “20 Questions” game, the more general your question should be. For example, suppose your opponent has chosen Austin Powers and you must be able to guess this within 20 questions. A question like “Is it alive?” will provide much more information than “Is it located in the state of California?” To phrase this differently, it will more effectively reduce the set of possible solutions.

One way of measuring how effective you are is Gini impurity. In plain English, this is a measure of how homogenous (uniform) a split is. This is best illustrated with an example.

Let’s suppose that we want to know which characteristics of people predict whether or not they like coffee. We start off asking two simple questions:

  1. What is your gender?
  2. Are you a morning person?

We survey four of our friends, and this is the result.

The result of a small survey.

We next want to find out which question has more predictive power. You can probably guess from this small example, but this doesn’t scale. Let’s create a split on both of these questions, then tally up the results.

Tallying up the results of our survey.

Now let’s start thinking in probabilities. Looking at the top node, we can see that all men we surveyed (2 out of 2) liked coffee. More formally, we can write:

P(Likes coffee | Man) = 1

Which means that the probability that someone likes coffee, given that they are a man, is 100%. The fact that all men surveyed like coffee also implies the following:

P(Doesn’t like coffee | Man) = 0

Then, to calculate the Gini impurity of this result, we square both values, and subtract them from 1.

1 – P(Likes coffee | Man)2 – P(Doesn’t like coffee | Man)2

= 1 – 12 – 02

= 0

The result of 0 implies that the result is completely pure, or homogenous, which is correct. Repeating the same steps above for the right node gives us an impurity of 0.5.

To calculate the impurity of the parent node, we take a weighted average of the children’s impurities:

(2 / 4) * 1 + (2 / 4) * 0.5 = 0.25

For your convenience, I will calculate the impurity of the “morning person” node, which is approximately 0.3.

Now we have a clear metric by which we can compare the two decisions. The gender feature has a lower split impurity of 0.25, meaning that it gives us more information. Therefore, our root node should be split on gender.

In a future blog post, I will talk about how to split on continuous values (like height). The technique is similar but requires a bit more work.

I’d like to thank StatQuest for their great video on classification trees. If you’d like to learn more, I highly recommend viewing it.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *