Decision Trees

Decision trees make decisions by learning where to cut on variables. Note that decision trees on their own are rarely used, but they are used as building blocks for more complex models. To learn why and to see a beautiful visualization of how decision trees work, please click this link. The more complex models can often solve complex problems, while being more explainable, easier to train and faster to run than neural networks. Unless you know you have a very complex problem, it is therefore often a good idea to start with a BDT or random forest to create a baseline.

Core Concepts

Since decision trees are building blocks, we need to understand how they are used in more complex models. The most common methods are Random Forests and Gradient Boosting.

  • Random Forests

    Random Forests are an ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. They help reduce overfitting by introducing randomness in two ways:

    • Training each tree on a random subset of the data (bootstrap sampling)
    • Selecting a random subset of features at each split

    This approach creates diverse trees that are less likely to overfit. You can see a beautiful visualization and explanation of how random forests work here.


  • Gradient Boosting

    Gradient Boosting is a powerful ensemble technique that builds trees sequentially, where each new tree helps correct errors made by previously trained trees. The process involves:

    • Training trees one at a time
    • Each new tree focuses on the errors of the previous trees
    • Combining predictions using a weighted sum

    Popular implementations include:

    • XGBoost - Optimized for speed and performance
    • LightGBM - Microsoft's gradient boosting framework
    • CatBoost - Yandex's gradient boosting library

Here is a table comparing the two methods.

Aspect Random Forests Gradient Boosting
Training Speed Faster (parallel training) Slower (sequential training)
Prediction Speed Slower Faster
Overfitting Less prone to overfitting More prone to overfitting
Hyperparameter Tuning Less sensitive More sensitive
Noise Handling Better Worse
Feature Importance More reliable Less reliable
Memory Usage Higher Lower
Best Use Cases General purpose, noisy data Structured data, competitions

Detailed Concepts

1. Basic Principles

2. Model Evaluation

3. Using decision trees in more complex models

4. Practical Considerations

Implementation Examples


# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

# Initialize and train the model
rf_model = RandomForestClassifier(
    n_estimators=100,    # Number of trees
    max_depth=10,        # Maximum depth of trees
    min_samples_split=2, # Minimum samples required to split
    random_state=42
)
rf_model.fit(X_train, y_train)

# Make predictions
predictions = rf_model.predict(X_test)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Print feature importance
print("Feature Importance:")
print(feature_importance)
                                        


# Import necessary libraries
import xgboost as xgb
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

# Initialize and train the model
xgb_model = xgb.XGBClassifier(
    n_estimators=100,    # Number of boosting rounds
    max_depth=6,         # Maximum depth of trees
    learning_rate=0.1,   # Step size shrinkage
    subsample=0.8,       # Subsample ratio of training instances
    colsample_bytree=0.8,# Subsample ratio of columns when constructing each tree
    random_state=42
)

# Train with early stopping
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=10,
    verbose=False
)

# Make predictions
predictions = xgb_model.predict(X_test)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

# Print feature importance
print("Feature Importance:")
print(feature_importance)

# Optional: Plot feature importance
import matplotlib.pyplot as plt
xgb.plot_importance(xgb_model)
plt.show()