Decision trees make decisions by learning where to cut on variables. Note that decision trees on their own are rarely used, but they are used as building blocks for more complex models. To learn why and to see a beautiful visualization of how decision trees work, please click this link. The more complex models can often solve complex problems, while being more explainable, easier to train and faster to run than neural networks. Unless you know you have a very complex problem, it is therefore often a good idea to start with a BDT or random forest to create a baseline.
Since decision trees are building blocks, we need to understand how they are used in more complex models. The most common methods are Random Forests and Gradient Boosting.
Random Forests are an ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. They help reduce overfitting by introducing randomness in two ways:
This approach creates diverse trees that are less likely to overfit. You can see a beautiful visualization and explanation of how random forests work here.
Gradient Boosting is a powerful ensemble technique that builds trees sequentially, where each new tree helps correct errors made by previously trained trees. The process involves:
Popular implementations include:
Here is a table comparing the two methods.
Aspect | Random Forests | Gradient Boosting |
---|---|---|
Training Speed | Faster (parallel training) | Slower (sequential training) |
Prediction Speed | Slower | Faster |
Overfitting | Less prone to overfitting | More prone to overfitting |
Hyperparameter Tuning | Less sensitive | More sensitive |
Noise Handling | Better | Worse |
Feature Importance | More reliable | Less reliable |
Memory Usage | Higher | Lower |
Best Use Cases | General purpose, noisy data | Structured data, competitions |
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
# Initialize and train the model
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Maximum depth of trees
min_samples_split=2, # Minimum samples required to split
random_state=42
)
rf_model.fit(X_train, y_train)
# Make predictions
predictions = rf_model.predict(X_test)
# Get feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
# Print feature importance
print("Feature Importance:")
print(feature_importance)
# Import necessary libraries
import xgboost as xgb
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
# Initialize and train the model
xgb_model = xgb.XGBClassifier(
n_estimators=100, # Number of boosting rounds
max_depth=6, # Maximum depth of trees
learning_rate=0.1, # Step size shrinkage
subsample=0.8, # Subsample ratio of training instances
colsample_bytree=0.8,# Subsample ratio of columns when constructing each tree
random_state=42
)
# Train with early stopping
xgb_model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=10,
verbose=False
)
# Make predictions
predictions = xgb_model.predict(X_test)
# Get feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)
# Print feature importance
print("Feature Importance:")
print(feature_importance)
# Optional: Plot feature importance
import matplotlib.pyplot as plt
xgb.plot_importance(xgb_model)
plt.show()