306 views
0 votes
0 votes
When choosing one feature from \(X_1, \ldots, X_n\) while building a Decision Tree, which of the following criteria is the most appropriate to maximize? (Here, \(H()\) means entropy, and \(P()\) means probability)

(a) \(P(Y | X_j)\)

(b) \(P(Y) - P(Y | X_j)\)

(c) \(H(Y) - H(Y | X_j)\)

(d) \(H(Y | X_j)\)

(e) \(H(Y) - P(Y)\)

1 Answer

Best answer
0 votes
0 votes
The most appropriate criterion to maximize when choosing a feature in a decision tree is \((c) \ H(Y) - H(Y | X_j)\).

Explanation:

\(H(Y)\) represents the entropy of the target variable \(Y\), measuring its uncertainty or randomness. \(H(Y | X_j)\) represents the conditional entropy of \(Y\) given a specific feature \(X_j\), indicating how much uncertainty remains about \(Y\) after knowing the value of \(X_j\).

Information Gain:

The difference between these two entropies, \(H(Y) - H(Y | X_j)\), is called the information gain associated with feature \(X_j\). It quantifies the reduction in uncertainty about \(Y\) achieved by knowing the value of \(X_j\).

Goal of Decision Trees:

Decision trees aim to create splits that reduce uncertainty about the target variable as much as possible. Therefore, maximizing the information gain, which means maximizing \(H(Y) - H(Y | X_j)\), is the most appropriate criterion for feature selection.
selected by

Related questions

444
views
1 answers
0 votes
rajveer43 asked Jan 16
444 views
Imagine you are guiding a robot through a grid-based maze using the A* algorithm. The robot is currently at node A (start) and wants to reach node B (goal). The heuristi...
350
views
1 answers
0 votes
rajveer43 asked Jan 16
350 views
Given the following table of observations, calculate the information gain $IG(Y |X)$ that would result from learning the value of $X$. XYRedTrueGreenFalseBrownFalseBrownF...
564
views
1 answers
0 votes
rajveer43 asked Jan 14
564 views
Suppose you have a three-class problem where class label \( y \in \{0, 1, 2\} \), and each training example \( \mathbf{X} \) has 3 binary attributes \( X_1, X_2, X_3 \in ...
336
views
1 answers
0 votes
rajveer43 asked Jan 15
336 views
In fitting some data using radial basis functions with kernel width $σ$, we compute training error of $345$ and a testing error of $390$.(a) increasing $σ$ will most li...