TF-DF: The Fastest and Most Robust Model for Beginners

A gentle introduction to TensorFlow Decision Forest

Juan
4 min readAug 9, 2023

This article aims to assist data scientists in comprehending the benefits of prioritizing TensorFlow Decision Forest as their primary choice for model creation.

What is TensorFlow Decision Forest (TF-DF)?

Released in 2021, TF-DF, short for TensorFlow Decision Forests, is a library that enables training, execution, and interpretation of various decision forest models such as Random Forest or Gradient Boosted Trees.

Photo by Thomas Griesbeck on Unsplash

🌳 Tree models

  • Decision Trees
  • Random Forest
  • Gradient Boosted Tree

For more information, check the original TensorFlow documentation about Decision Forests.

Some cool things about TF-DF are that:

  • TF-DF requires minimal pre-processing. In fact, TFDF supports empty values for numerical and categorical features although not for boolean.
  • TF-DF will outperform or provide a strong baseline and also helps to understand the data.

Decision Trees (and tree models in general) are the best place to start for working with tabular data as these models will outperform or provide a strong baseline and help to understand the data.

Using TF-DF in a Kaggle Competition

In order to test this model, I have used it in the Kaggle Competition of Spaceship Titanic achieving an accuracy of 80%.

Data-Preprocessing

The amount of pre-processing done was minimum:

  • Transformed the boolean types → int types using 1/0.
  • Generating a few new features more correlated with the target and dropping useless columns.
Summary of the initial preprocessing applied

As the final step, it is necessary to transform the datasets that will be used in the model and evaluation stage into TensorFlow format.

df_tf_train = tfdf.keras.pd_dataframe_to_tf_dataset(x_train, label=target_name)
df_tf_val = tfdf.keras.pd_dataframe_to_tf_dataset(x_val, label=target_name)
df_tf_test = tfdf.keras.pd_dataframe_to_tf_dataset(df_test)

Model Selection and Evaluation

I opt to utilize Ensemble Learning through the application of both RandomForest and Gradient Boosted Tree models.

Random Forest

Random Forest is a collection of decision trees, each trained independently and without pruning on a random subset of the training dataset (sampled with replacement).

The algorithm is unique in that:

  • It is robust to overfitting and easy to use
  • Can provide a list of most important features.
  • Additionally, it is possible to configure the model or to choose hyperparameters by default.

GradientBoostedTrees

A GBT (Gradient Boosted Tree) is a set of shallow decision trees trained sequentially. Each tree is trained to predict and then “correct” for the errors of the previously trained trees (more precisely each tree predicts the gradient of the loss relative to the model output).

Some advantages of this model are:

  1. Sequential Improvement as mentioned above.
  2. Focused on Hard Cases: GBT gives more attention to challenging instances by assigning weights, helping to improve predictions where Random Forest might struggle.
  3. Stronger Predictions: The iterative process of GBT often results in a stronger predictive model, especially when fine-tuned, making it great for tasks where accuracy is crucial.
    # Generate and train the model
# ------------------------------
model = generate_tfdf_model(model_name, hyperparam=None)
model.compile(metrics=["accuracy"])
model.fit(df_tf_train)

def generate_tfdf_model(model_name, hyperparam=None):
"""
Function to generate the model
"""

# Models without hyperparameters
# ------------------------------
if hyperparam is None:
if model_name == 'random_forest':
model = tfdf.keras.RandomForestModel()
elif model_name == 'gradient_boosted_trees':
model = tfdf.keras.GradientBoostedTreesModel()
elif model_name == 'cart':
model = tfdf.keras.CartModel()

else:
# Models with hyperparameters
# ------------------------------
if model_name == 'random_forest':
model = tfdf.keras.RandomForestModel(hyperparameter_template=hyperparam, task='classification')
elif model_name == 'gradient_boosted_trees':
model = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template=hyperparam, task='classification')
elif model_name == 'cart':
model = tfdf.keras.CartModel(hyperparameter_template=hyperparam, task='classification')

print('Model', model_name, 'created')
return model

Variable importance

One interesting thing about these models (decision tree models) is that you can check the Variable Importance (VI) and describe the impact of each feature to the model.

  • VIs generally indicate how much a variable contributes to the model predictions or quality. Different VIs have different semantics and are generally not comparable.
  • The VIs returned by variable_importances() depends on the learning algorithm and its hyper-parameters.
importance = model.make_inspector().variable_importances()

Conclusion

TF-DF makes it easy to train RandomForest and GradientBoostedTrees models as they require a minimal pre-processing and they perform really great with tabular data.

If you want to try a first quick shot forecasting, you can train the algorithm with just a few lines of code plus a lot of default hyper-parameters.

And last but not least, it can shed some light about the features that the model could consider most important in order to make new decisions.

Documentation

Contact me!

Stay Connected and Explore Further Insights on Data Science!

If you’re eager to delve deeper into the world of data science and continue learning, there are several ways to stay connected with me:

By staying connected, you’ll be the first to know about new and exciting developments in the field !

--

--

Juan

🎯 Senior Data Scientist at Bravo Studio | 🎮 Ex-FRVR Game Data Scientist | 🤖 MSc in AI & Computer Science