Kaneki Logo

Ilyass Ezzam

(KanekiEzz)AI Engineer & Full Stack Developer

← Back to posts
MachineLearningPythonClassificationEnsembleLearning

Random Forests From Scratch

3 May 2026·8 min read

Introduction

Random forests are known as ensemble learning methods used for classification and regression. Random forests are essentially a collection of decision trees that are each fit on a subsample of the data. While an individual tree is typically noisy and subject to high variance, random forests average many different trees, which in turn reduces the variability and leave us with a powerful classifier.

Random forests are also non-parametric and require little to no parameter tuning. They differ from many common machine learning models used today that are typically optimized using gradient descent. Models like linear regression, support vector machines, neural networks, etc. require a lot of matrix based operations, while tree based models like random forest are constructed with basic arithmetic.

Decision Trees

A quick overview if you're not familiar with binary decision trees. We start from the very top, which we'll call the root node, and ask a simple question. If the answer to the question is correct we'll move to the left node directly connected below, called the left child, otherwise, if the answer is wrong we'll move down to the right node, called the right child. We'll repeat this process until we reach one of the bottom nodes, also known as the terminal nodes.

For classification the terminal nodes output the class that is the mode while in the context of regression they'll output the mean prediction.

The Problem with Single Trees

The problem with relying on a single tree is that it needs a lot of depth to gain strong predictive power. Binary decision trees can have up to size 2d+1−1, where d is the depth of the tree. For example, a tree with a depth of 10 can ask up to 2047 different questions. This ultimately leads to a lot of complexity within our tree and in the world of machine learning high complexity leads to high variance.

Decision trees have what's called low bias and high variance. This just means that our model is inconsistent, but accurate on average. Imagine a dart board filled with darts all over the place missing left and right, however, if we were to average them into just 1 dart we could have a bullseye. Each individual tree can be thought of as the inaccurate darts and a random forest would give us that bullseye.

Entropy & Information Gain

Entropy

The most commonly used measurements for constructing binary decision trees are: Entropy, Classification Error, and Gini index. In this article we'll be focusing on entropy, which is a measurement of uncertainty (sometimes called impurity) that uses the following formula:

H(X) = −∑j pjlog pj

such that pj is the probability of class j. For binary classification entropy takes on the form:

H(X) = −p log₂ p − (1−p) log₂(1−p)

where p denotes P(X=1) (the probability that a passenger survived).

Entropy Properties:

  • Certainty: Entropy is minimized when all samples in a node belong to the same class such that P(X=1)=1. Result: −1log₂(1)−0log₂(0)=0
  • Uncertainty: Entropy is maximized when we have a uniform class distribution such that P(X=1) = 0.5. Result: −0.5log₂(0.5)−0.5log₂(0.5)=1

Information Gain

When we compare the entropy from before and after a split we get what's called information gain. Information gain measures how much information we gained when splitting a node at a particular value. Information gain IG is computed with the following formula:

IG(D) = I(Dp) − (Nleft/Np)I(Dleft) − (Nright/Np)I(Dright)

This can be interpreted as: information gain = entropybefore − entropyafter

We'll always want to split a node with the value that maximizes information gain.

Bootstrapping & Bagging

Bootstrapping

One of the main reasons random forests are so powerful is due to the randomness injected into each tree. Each individual decision tree will be constructed on a bootstrapped subset of our data. If our dataset has n observations bootstrapping is the process of sampling n points with replacement.

This means that some observations in our dataset will be selected more than once and some won't be selected at all. We can calculate that the probability an observation is omitted from our bootstrapped dataset is (1−1/n)n. Since e−1 ≈ 0.368 ≈ 1/3, bootstrapping n samples with replacement will leave out approximately 1/3 of the observations in each distinct tree.

Since each individual tree is built using only 2/3 of the data we'll find that most trees will differ significantly from one another.

Out-of-Bag Error Estimate

The OOB (out-of-bag) samples are the ≈1/3 observations that were not selected to build a particular tree. Once we've built our tree with the n bootstrapped observations we can test each sample that was left out and compute the mean prediction error from those samples. This is essentially leave-one-out cross validation.

Bagging

Bagging is the process of growing a tree where each node in the tree looks at every value in our bootstrapped sample in every feature to find the best split in the data at that particular node.

Random forests follow the same procedure as bagging, however, the key difference is that on a dataset with p features each tree will only look at a subset of m features where m = √p. This injects even more randomness into the model. With this, we'll be able to produce many uncorrelated trees which will help us capture a lot of the variability as well as interactions between multiple variables.

The Random Forest Algorithm

Suppose we have the following data {(x⃗₁,y₁),(x⃗₂,y₂),...,(x⃗ₙ,yₙ)} where each x⃗ᵢ represents a feature vector and let B be the number of trees we want to construct in our forest. We will do the following:

for b=1 to B:
  Draw a bootstrap sample of size n from the data
  Grow a decision tree Tᵦ from our bootstrapped sample, by repeating:
    Sample m = √p features (where p is the number of features)
    Compute the information gain for each possible value
    Split the node into 2 children nodes
  Output the ensemble of trees {T}¹ᴮ

Implementation in Python

Entropy & Information Gain Functions

import random
import pandas as pd
import numpy as np

def entropy(p):
    if p == 0:
        return 0
    elif p == 1:
        return 0
    else:
        return - (p * np.log2(p) + (1 - p) * np.log2(1-p))

def information_gain(left_child, right_child):
    parent = left_child + right_child
    p_parent = parent.count(1) / len(parent) if len(parent) > 0 else 0
    p_left = left_child.count(1) / len(left_child) if len(left_child) > 0 else 0
    p_right = right_child.count(1) / len(right_child) if len(right_child) > 0 else 0
    IG_p = entropy(p_parent)
    IG_l = entropy(p_left)
    IG_r = entropy(p_right)
    return IG_p - len(left_child) / len(parent) * IG_l - len(right_child) / len(parent) * IG_r

Bootstrap Function

def draw_bootstrap(X_train, y_train):
    bootstrap_indices = list(np.random.choice(range(len(X_train)), len(X_train), replace = True))
    oob_indices = [i for i in range(len(X_train)) if i not in bootstrap_indices]
    X_bootstrap = X_train.iloc[bootstrap_indices].values
    y_bootstrap = y_train[bootstrap_indices]
    X_oob = X_train.iloc[oob_indices].values
    y_oob = y_train[oob_indices]
    return X_bootstrap, y_bootstrap, X_oob, y_oob

def oob_score(tree, X_test, y_test):
    mis_label = 0
    for i in range(len(X_test)):
        pred = predict_tree(tree, X_test[i])
        if pred != y_test[i]:
            mis_label += 1
    return mis_label / len(X_test)

Finding the Best Split Point

def find_split_point(X_bootstrap, y_bootstrap, max_features):
    feature_ls = list()
    num_features = len(X_bootstrap[0])

    while len(feature_ls) <= max_features:
        feature_idx = random.sample(range(num_features), 1)
        if feature_idx not in feature_ls:
            feature_ls.extend(feature_idx)

    best_info_gain = -999
    node = None
    for feature_idx in feature_ls:
        for split_point in X_bootstrap[:,feature_idx]:
            left_child = {'X_bootstrap': [], 'y_bootstrap': []}
            right_child = {'X_bootstrap': [], 'y_bootstrap': []}

            # split children for continuous variables
            if type(split_point) in [int, float]:
                for i, value in enumerate(X_bootstrap[:,feature_idx]):
                    if value <= split_point:
                        left_child['X_bootstrap'].append(X_bootstrap[i])
                        left_child['y_bootstrap'].append(y_bootstrap[i])
                    else:
                        right_child['X_bootstrap'].append(X_bootstrap[i])
                        right_child['y_bootstrap'].append(y_bootstrap[i])
            # split children for categoric variables
            else:
                for i, value in enumerate(X_bootstrap[:,feature_idx]):
                    if value == split_point:
                        left_child['X_bootstrap'].append(X_bootstrap[i])
                        left_child['y_bootstrap'].append(y_bootstrap[i])
                    else:
                        right_child['X_bootstrap'].append(X_bootstrap[i])
                        right_child['y_bootstrap'].append(y_bootstrap[i])

            split_info_gain = information_gain(left_child['y_bootstrap'], right_child['y_bootstrap'])
            if split_info_gain > best_info_gain:
                best_info_gain = split_info_gain
                left_child['X_bootstrap'] = np.array(left_child['X_bootstrap'])
                right_child['X_bootstrap'] = np.array(right_child['X_bootstrap'])
                node = {'information_gain': split_info_gain,
                        'left_child': left_child,
                        'right_child': right_child,
                        'split_point': split_point,
                        'feature_idx': feature_idx}

    return node

Node Splitting & Tree Building

def terminal_node(node):
    y_bootstrap = node['y_bootstrap']
    pred = max(y_bootstrap, key = y_bootstrap.count)
    return pred

def split_node(node, max_features, min_samples_split, max_depth, depth):
    left_child = node['left_child']
    right_child = node['right_child']

    del(node['left_child'])
    del(node['right_child'])

    if len(left_child['y_bootstrap']) == 0 or len(right_child['y_bootstrap']) == 0:
        empty_child = {'y_bootstrap': left_child['y_bootstrap'] + right_child['y_bootstrap']}
        node['left_split'] = terminal_node(empty_child)
        node['right_split'] = terminal_node(empty_child)
        return

    if depth >= max_depth:
        node['left_split'] = terminal_node(left_child)
        node['right_split'] = terminal_node(right_child)
        return node

    if len(left_child['X_bootstrap']) <= min_samples_split:
        node['left_split'] = terminal_node(left_child)
    else:
        node['left_split'] = find_split_point(left_child['X_bootstrap'], left_child['y_bootstrap'], max_features)
        split_node(node['left_split'], max_features, min_samples_split, max_depth, depth + 1)
    
    if len(right_child['X_bootstrap']) <= min_samples_split:
        node['right_split'] = terminal_node(right_child)
    else:
        node['right_split'] = find_split_point(right_child['X_bootstrap'], right_child['y_bootstrap'], max_features)
        split_node(node['right_split'], max_features, min_samples_split, max_depth, depth + 1)

Building the Random Forest

def build_tree(X_bootstrap, y_bootstrap, max_depth, min_samples_split, max_features):
    root_node = find_split_point(X_bootstrap, y_bootstrap, max_features)
    split_node(root_node, max_features, min_samples_split, max_depth, 1)
    return root_node

def random_forest(X_train, y_train, n_estimators, max_features, max_depth, min_samples_split):
    tree_ls = list()
    oob_ls = list()
    for i in range(n_estimators):
        X_bootstrap, y_bootstrap, X_oob, y_oob = draw_bootstrap(X_train, y_train)
        tree = build_tree(X_bootstrap, y_bootstrap, max_depth, min_samples_split, max_features)
        tree_ls.append(tree)
        oob_error = oob_score(tree, X_oob, y_oob)
        oob_ls.append(oob_error)
    print("OOB estimate: {:.2f}".format(np.mean(oob_ls)))
    return tree_ls

Making Predictions

def predict_tree(tree, X_test):
    feature_idx = tree['feature_idx']

    if X_test[feature_idx] <= tree['split_point']:
        if type(tree['left_split']) == dict:
            return predict_tree(tree['left_split'], X_test)
        else:
            value = tree['left_split']
            return value
    else:
        if type(tree['right_split']) == dict:
            return predict_tree(tree['right_split'], X_test)
        else:
            return tree['right_split']

def predict_rf(tree_ls, X_test):
    pred_ls = list()
    for i in range(len(X_test)):
        ensemble_preds = [predict_tree(tree, X_test.values[i]) for tree in tree_ls]
        final_pred = max(ensemble_preds, key = ensemble_preds.count)
        pred_ls.append(final_pred)
    return np.array(pred_ls)

Training & Evaluation

n_estimators = 100
max_features = 3
max_depth = 10
min_samples_split = 2

model = random_forest(X_train, y_train, n_estimators=100, max_features=3, max_depth=10, min_samples_split=2)
# OOB estimate: 0.22

preds = predict_rf(model, X_test)
acc = sum(preds == y_test) / len(y_test)
print("Testing accuracy: {}".format(np.round(acc,3)))
# Testing accuracy: 0.768

Parameters Explained

  • n_estimators: (int) The number of trees in the forest.
  • max_features: (int) The number of features to consider when looking for the best split (typically √p).
  • max_depth: (int) The maximum depth of the tree.
  • min_samples_split: (int) The minimum number of samples required to split an internal node.

Conclusion

Random forests are great for an array of reasons that include easy implementation and requiring little to no parameter tuning. RF's may not be powerhouses like neural networks or gradient boosting models, but they should certainly be in everyone's machine learning repertoire.

There are methods that can allow you to see which variables are more important than others, but are worth looking into. Hopefully this can also help you get in the right direction if you are interested in understanding how gradient boosting models like XGBoost work.

← All postsBrowse tags →