Deep Dive into the Tree Based Models

Published in

DataDrivenInvestor

10 min readMar 3, 2021

It is majorly used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). Tree-based models use a series of if-then rules to generate predictions from one or more decision trees. All tree-based models can be used for either regression (predicting numerical values) or classification (predicting categorical values). Hence, We’ll explore five different types of tree-based models.

Supervised Learning and Unsupervised Learning

Supervised learning is the subfield of machine learning in which you train a model using input data and corresponding labels. In supervised learning, each example is a pair consisting of the input data and an output value which represents a category or label in the case of classification, or a numeric value in the case of regression. The converse is called unsupervised learning, where you learn from the input data alone.

Tree Based Models

Decision tree-based models, including tree-based ensemble models such as Random Forests and Gradient Boosting Machines (or GBMs). Tree-based models stand out from other types of machine learning models due to their unique combination of model interpretability, ease-of-use, and, when used in ensembles, excellent accuracy. They are used to make decisions, explore the data and make predictions. In this article, you will interpret and explore different use cases, build and evaluate classification and regression models tune model parameters for optimal performance.

How could algorithms put individuals and communities in harm's way? | DataDrivenInvestor

As distant and aloof as mathematical equations may seem, they are also commonly associated with reliable, hard science…

www.datadriveninvestor.com

Introduction to Classification Tree

A classification tree is a decision tree that performs a classification (vs regression) task. It is a hierarchical structure with nodes and directed edges. The node at the top is called the root node. The nodes at the bottom are called the leaf nodes or terminal nodes. Nodes that are neither the root node or the leaf nodes are called internal nodes. The root and internal nodes have binary test conditions associated with them and each leaf node has an associated class label.

“rpart” is short for recursive partitioning which is a process used for training both classification and regression trees.

Let’s train a decision tree model to understand which loan applications are at higher risk of default using a subset of the German Credit Dataset. The response variable, called “default”, indicates whether the loan went into a default or not, which means this is a binary classification problem. Both categorical and continuous predictors are used for binary classification.

# Look at the data
str(creditsub)# Create the model
credit_model <- rpart(formula = default ~ .,data = creditsub, method = "class")# Display the results
rpart.plot(x = credit_model, yesno = 2, type = 0, extra = 0)

In R, formulas are used to model the response as a function of some set of predictors, here is default~., which means use all columns (except the response column) as predictors.
In the rpart() function, note that you'll also have to provide the training data frame.
Using the model object that you create, plot the decision tree model using the rpart.plot() function from the rpart.plot package.

This tree predicts whether a loan applicant will default on their loan (or not). After following the path, if you will end up in a “Yes” leaf means that the model predicts that this individual will default on their loan, where as a “No” prediction means that they will not default on their loan.

Assume we have a loan applicant who:

is applying for a 20-month loan
is requesting a loan amount that is 2% of their income
is 25 years old

Answer would be Yes, the person will default on their loan.

Let’s move on to the modeling. Now, we’ll keep discussing about the restaurant dataset to create a classification tree to make decisions.

First, we will split the data into 2 pieces. The first part of the data, is called training set, can be used for building the model and the second part of the data, the test set, can be used to test the results. One common way of doing this is to use 70% of the data for a training set and 30% of the data for the test set. Of course, there can be a lot of variation possible. One way to reduce this variation is by using cross-validation.

# Total number of rows in the credit data frame
n <- nrow(credit)# Number of rows for the training set (70% of the dataset)
n_train <- round( 0.7 * n)# Create a vector of indices which is an 70% random sample
set.seed(123)
train_indices <- sample(1:n, n_train)# Subset the credit data frame to training indices only
credit_train <- credit[train_indices, ]  
  
# Exclude the training indices to create the test set
credit_test <- credit[-train_indices, ]

Train a classification tree model

# Train the model (to predict 'default')
credit_model <- rpart(formula = default ~ ., 
                      data = credit_train, 
                      method = "class")# Look at the model output                      
print(credit_model)

At this point, you have a classification tree model ready. The next steps would be to make a prediction and evaluate your predictions.

To begin with, The rpart package has a predict function where the first argument is the trained model and the second argument is the test dataset. It consists the type argument, which controls whether the function returns predicted labels or the raw predicted values. In classification problems, the model generates a raw predicted value for each class. When type equals “class”, the function will return a predicted class label and when type equals “prob”, a matrix of raw predicted values is returned instead, with one column for each class.

There are many ways to evaluate classification performance. Accuracy, confusion matrix, log-loss, and Area under the ROC Curve (or AUC) are some of the most popular metrics for binary classification problems. For now, we will introduce the confusion matrix and accuracy.

Confusion Matrix: The columns of a confusion matrix correspond to the truth labels, and the rows represent the predictions. In the 2-class case, a single prediction has four different possible outcomes: The true positives and true negatives are correct classifications. A false positive is when the outcome is incorrectly predicted as positive when it is actually negative. A false negative is when the outcome is incorrectly predicted as negative when it is actually positive, the same concept can be extended to any number of classes.

Accuracy: It measures how often the classifier predicts the class correctly. It is defined as the ratio between the number of correct predictions and the total number rows or data points in the data.

Choose Recall if the occurrence of false negatives is unaccepted/intolerable. Recall gives us an idea about when it’s actually yes, how often does it predict yes.

Choose Precision if you want to be more confident of your true positives. Precision tells us about when it predicts yes, how often is it correct.

Choose Specificity if you want to cover all true negatives, i.e. meaning we do not want any false alarms or false positives.

To create a confusion matrix in R, we will use the confusionMatrix function from the caret package. The arguments you need to specify are: the data and the reference. The data argument is a vector of predicted class labels on a test set and the reference is a vector of the true class labels.

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,  
                        newdata = credit_test,   
                        type = "class")  
                            
# Calculate the confusion matrix for the test set
confusionMatrix(data = class_prediction,       
                reference = credit_test$default)

We can conclude that:

Accuracy value of 64% means that identification of 72 of every 200 default loan predicted values is incorrect, and 128 is correct.
Precision value of 86% means that label of 28 of every 200 default loan values is a not default, and 172 are default.
Recall value is 50% means that 100 of every 200 default loan , in reality, are missed by our model and 100 are correctly identified as defaults.
Specificity value is 86% means that 28 of every 200 default loans in reality are miss-labeled as default and 172 are correctly labeled.

Splitting criterion in trees

A classification tree uses a split condition to predict class labels based on one or more input variables. The classification process starts from the root node of the tree and at each node, the process will check whether the input value should recursively continue to the right or left sub-branch according to the split condition. This process stops when meeting any leaf or terminal nodes.

The idea behind classification trees is to split the data into subsets where each subset belongs to only one class. This is accomplished by dividing the input space into “pure” regions, that is — regions with samples from only one class. With real data, completely pure regions may not be possible, so the decision tree will do the best it can to create regions that are as pure as possible. Boundaries separating these regions are called decision boundaries, and the decision tree model makes classification decisions based on these decision boundaries.The goal is to partition data at a node into subsets that are as pure as possible.

In this example, the partition shown on the right results in more homogeneous subsets. Since these subsets contain more samples belonging to a single class than the resulting subsets shown on the left. So the partition on the right results in purer subsets and is the preferred partition.

Therefore, we need a way to measure the purity of a split in order to compare different ways to partition a set of data. It turns out that it works out better mathematically if we measure the impurity rather than the purity of a split. So the impurity measure of a node specifies how mixed the resulting subsets are.

Gini Index: A common impurity measure used for determining the best split is the Gini Index. The lower the Gini Index, the higher the purity of the split. So, the decision tree will select the split that minimizes the Gini Index.

Besides the Gini Index, other impurity measures include: entropy (or information gain), and misclassification rate which we will discuss in the next article.

Compare models with a different splitting criterion

Train two models that use a different splitting criterion and use the validation set to choose a “best” model from this group. To do this you’ll use the parms argument of the rpart() function. This argument takes a named list that contains values of different parameters you can use to change how the model is trained. Set the parameter split to control the splitting criterion.

# Train a gini-based model, splitting the tree based on gini index.
credit_model1 <- rpart(formula = default ~ ., 
                       data = credit_train, 
                       method = "class",
                       parms = list(split = "gini"))# Train an information-based model, splitting the tree based on information index.
credit_model2 <- rpart(formula = default ~ ., 
                       data = credit_train, 
                       method = "class",
                       parms = list(split = "information"))# Generate predictions on the validation set using the gini model
pred1 <- predict(object = credit_model1, 
             newdata = credit_test,
             type = "class")# Generate predictions on the validation set using the information model
pred2 <- predict(object = credit_model2, 
             newdata = credit_test,
             type = "class")# Compare classification error
ce(actual = credit_test$default, 
   predicted = pred1)
ce(actual = credit_test$default, 
   predicted = pred2)

Classification error is the fraction of incorrectly classified instances. Compute and compare the test set classification error of the two models by using the ce() function.

**Output:** Comparison between GINI and INFORMATION Index

After getting the final result, it is concluded that the model predicted by using information based model is better than gini based model as the classification error is higher for the gini based model.

Hence, we have covered the overview of classification tree with R, and in the process learned some real insights from the german credit dataset. We have also learnt about accuracy and confusion matrix to understand more about the performance of the dataset.

Thanks for Reading and Stay Connected !

Gain Access to Expert View — Subscribe to DDI Intel