Random Forests from Scratch: A Step-by-Step Guide

Random Forests are a powerful ensemble learning method used for classification and regression tasks. In this post, we'll build a Random Forest model from scratch in Python, exploring the key concepts and algorithms behind it.

Understanding Decision Trees

Before diving into Random Forests, it's essential to understand Decision Trees. A Decision Tree is a flowchart-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome (or class label).

Building a Decision Tree

To build a Decision Tree, we use algorithms like ID3, C4.5, or CART. The basic idea is to split the dataset into subsets based on the feature that results in the highest information gain (or lowest Gini impurity).

Random Forests: The Ensemble Method

A Random Forest is an ensemble of Decision Trees. It combines the predictions of multiple trees to improve accuracy and reduce overfitting. Each tree is trained on a random subset of the data and a random subset of features.

Implementing Random Forests from Scratch

Here's a simplified implementation of Random Forests in Python:


import numpy as np
from collections import Counter

class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)

    def _build_tree(self, X, y, depth=0):
        # Implement the logic to build the decision tree
        pass

    def predict(self, X):
        # Implement the logic to make predictions
        pass

class RandomForest:
    def __init__(self, n_trees=10, max_depth=None):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.trees = []

    def fit(self, X, y):
        for _ in range(self.n_trees):
            # Bootstrap sampling
            indices = np.random.choice(len(X), len(X), replace=True)
            X_sample = X[indices]
            y_sample = y[indices]
            tree = DecisionTree(max_depth=self.max_depth)
            tree.fit(X_sample, y_sample)
            self.trees.append(tree)

    def predict(self, X):
        # Aggregate predictions from all trees
        tree_preds = np.array([tree.predict(X) for tree in self.trees])
        return [Counter(tree_preds[:, i]).most_common(1)[0][0] for i in range(len(X))]

Random Forest Classifier vs Random Forest Regressor

Although both algorithms are based on the same idea—combining multiple decision trees—they solve different types of problems.

Random Forest Classifier

A Random Forest Classifier is used when the target is a category (class label).

Imagine we have the following training data:

House Size (m²)	House Type
50	Apartment
60	Apartment
150	Villa
180	Villa

Now suppose we want to predict the type of a new house with a size of 160 m².

Each decision tree in the forest makes its own prediction:

Tree 1 → Villa
Tree 2 → Villa
Tree 3 → Apartment
Tree 4 → Villa
Tree 5 → Villa

We then count the votes:

Villa: 4 votes
Apartment: 1 vote

Since Villa receives the most votes, the final prediction is:

Prediction: Villa

This process is called Majority Voting.

Random Forest Regressor

A Random Forest Regressor is used when the target is a continuous numerical value.

Instead of predicting the house type, suppose we want to predict its price.

Our training data looks like this:

House Size (m²)	Price (DH)
50	500,000
60	600,000
150	1,800,000
180	2,200,000

Now we want to estimate the price of a 160 m² house.

Each tree predicts a different price:

Tree 1 → 1,900,000 DH
Tree 2 → 2,000,000 DH
Tree 3 → 1,850,000 DH
Tree 4 → 1,950,000 DH
Tree 5 → 2,100,000 DH

Unlike classification, we cannot use majority voting because every prediction is a different number.

Instead, we compute the average:

(1,900,000 + 2,000,000 + 1,850,000 + 1,950,000 + 2,100,000) / 5 = 1,960,000

Therefore, the final prediction is:

Prediction: 1,960,000 DH

Key Difference

Random Forest Classifier	Random Forest Regressor
Predicts categories (classes)	Predicts continuous numerical values
Example: Apartment or Villa	Example: House price
Final prediction = Majority Vote	Final prediction = Average of all tree predictions

Conclusion

In this post, we explored the fundamentals of Decision Trees and how they form the basis of Random Forests. We also provided a simple implementation of Random Forests from scratch in Python. While this implementation is basic, it serves as a great starting point for understanding the mechanics behind ensemble learning methods.

Random Forest From Scratch

The real implementation — BaseModel, LabelEncoderSimple, Calculate, _Decision_Tree, RandomForest — with every formula worked by hand, an entropy-vs-Gini comparison, and a reusable project structure for any ML pipeline.

Features:

Decision trees
Bootstrap sampling
Feature bagging
Gini impurity
Prediction voting

Pipeline


data -> Bootstrapping -> tree -> voting -> predictions

Class map:


                RandomForest
                     │
        ┌────────────┼────────────┐
        │            │            │
        ▼            ▼            ▼
   DecisionTree   DecisionTree  DecisionTree
   (_Decision_Tree)(_Decision_Tree)(_Decision_Tree)
        │            │            │
        ▼            ▼            ▼
     Tree 1        Tree 2       Tree 3

We'll follow one running example all the way through, matching the shape of a typical test case:


X (age, salary-in-k) :  [1,20] [2,21] [3,22] [4,23] [5,24] [6,25] [7,26] [8,27]
y (class)             :    0     0     0      1      1      1      1      1

Three samples belong to class 0, five belong to class 1. Every formula below is computed on this exact data so the numbers are traceable end to end.

Part 1 — The Formulas (with worked comparisons)

1.1 Entropy — `H_parent`, `H_left`, `H_right`

Formula:


H(S) = - Σ p_i * log2(p_i)

Why we need it: entropy is a single number that tells us how "mixed" a node is. A pure node (one class only) has entropy 0; a 50/50 mix has maximum entropy. Trees use this number to decide whether a split actually helped.

Worked example (from the code's own docstring: y = [1, 2, 2, 4, 1]):

class	count	probability
1	2	2/5 = 0.4
2	2	2/5 = 0.4
4	1	1/5 = 0.2


H(y) = -(0.4*log2(0.4) + 0.4*log2(0.4) + 0.2*log2(0.2))
     = -(0.4*(-1.322) + 0.4*(-1.322) + 0.2*(-2.322))
     = 0.529 + 0.529 + 0.464
     = 1.522 bits

Why three named methods (H_parent, H_left, H_right) instead of one generic _entropy? All three call the exact same formula underneath — the split exists so the code reads like the math. H_parent(y) next to H_left(y_left) and H_right(y_right) mirrors the whiteboard notation H(parent), H(left), H(right) directly, which makes debugging a specific split far easier than one anonymous call.

1.2 Gini Impurity

Formula:


Gini(S) = 1 - Σ p_i²

Why it exists as an alternative to entropy: entropy needs a logarithm per class, per node, per candidate threshold — for large datasets that adds up. Gini gives almost the same ranking of "which split is better" using only squaring and subtraction, which is why it's the faster default in many implementations (including scikit-learn's).

Same example, computed with Gini:


Gini(y) = 1 - (0.4² + 0.4² + 0.2²)
        = 1 - (0.16 + 0.16 + 0.04)
        = 1 - 0.36
        = 0.64

1.3 Mo9arana (Comparison): Entropy vs. Gini

	Entropy	Gini
Formula	`-Σ p_i·log2(p_i)`	`1 - Σ p_i²`
Range (binary)	0 → 1	0 → 0.5
Computation	needs `log2`	only multiplication
Speed	slower	faster
Sensitivity to class mix	slightly more sensitive to rare classes	slightly smoother
Typical use	classic ID3/C4.5 trees	CART, scikit-learn default

Practical takeaway: they almost always pick the same split as "best" — the difference in outcome is usually tiny. This is exactly why the real implementation exposes criterion="entropy" or criterion="gini" as a constructor option instead of hard-coding one.

1.4 Before vs. After Split — the Real Comparison

This is the core comparison that drives every decision the tree makes.

Before the split (entropy of the parent, using our running dataset — 3 samples of class 0, 5 of class 1):


H_parent = -(3/8 * log2(3/8) + 5/8 * log2(5/8))
         = -(0.375*(-1.415) + 0.625*(-0.678))
         = 0.531 + 0.424
         = 0.954 bits

After the split at age <= 3:

Left child → ages [1,2,3] → all class 0 → pure → H_left = 0
Right child → ages [4,5,6,7,8] → all class 1 → pure → H_right = 0

Weighted average ("H_after"):


H_after = (n_left/n) * H_left + (n_right/n) * H_right
        = (3/8)*0 + (5/8)*0
        = 0

Comparison:

	Entropy
Before split (`H_parent`)	0.954
After split (`H_after`)	0.000
Reduction	0.954

The split completely separated the two classes — impurity dropped from 0.954 to 0, the best possible outcome.

1.5 Information Gain — `IG(D)`

Formula:


IG(D) = H(parent) - H_after
      = H(parent) - [ (n_left/n)*H(left) + (n_right/n)*H(right) ]

Why it matters: entropy alone only tells you how mixed one node is. Information Gain tells you how much a specific candidate split improved things. The tree tries many (feature, threshold) pairs and keeps whichever gives the highest IG — that's the actual decision rule behind growing the tree.

Worked example (continuing from above):


IG(D) = 0.954 - 0.000 = 0.954   <- maximum possible gain for this data

The Gini equivalent (gini_gain) follows the identical pattern, just swapping gini(...) in for every _entropy(...) term.

Part 2 — `BaseModel`: A Shared Foundation

Every class in this codebase inherits from BaseModel, which does two things:

Defines the common fit(X, y) / predict(X) interface (raising NotImplementedError if a subclass forgets to override them) — so every model in the project is guaranteed to expose the same two methods.
Provides a colored log() utility, gated behind a debug flag (read from the DEBUG environment variable), so training can be traced without littering the code with print() statements:


DEBUG = os.getenv("DEBUG", "0").lower() in ("1", "true", "yes")

class BaseModel:
    def log(self, msg, level="debug"):
        if not self.debug:
            return
        # ...colored print by level: debug/info/warn/error

    def fit(self, X, y):
        raise NotImplementedError("fit() must be implemented")

    def predict(self, X):
        raise NotImplementedError("predict() must be implemented")

LabelEncoderSimple, Calculate, _Decision_Tree, and RandomForest all inherit from it, so every part of the pipeline can be debugged the same way: RandomForest(debug=True).

Part 3 — The Code, Function by Function (with "why" and examples)

3.1 `LabelEncoderSimple` — why encode labels at all?

Trees compare values numerically (x[feature] <= threshold). A label like "cat" can't be compared with <=. The encoder maps each unique class to an integer and remembers the mapping so predictions can be decoded back:


class LabelEncoderSimple(BaseModel):
    def fit(self, y) -> None:
        unique_classes = sorted(set(y))
        self.class_to_index = {c: index for index, c in enumerate(unique_classes)}
        self.index_to_class = {i: c for c, i in self.class_to_index.items()}

    def transform(self, y):
        return np.array([self.class_to_index.get(c, -1) for c in y])

    def inverse_transform(self, y_encoded):
        return np.array([self.index_to_class.get(i, None) for i in y_encoded])

Example: y = ["cat","dog","cat"] → class_to_index = {"cat":0, "dog":1} → encoded [0, 1, 0]. After prediction, inverse_transform([0, 1, 0]) gives back ["cat","dog","cat"]. This round-trip is why the class stores both dictionaries instead of just one.

The forest wrapper doesn't force you to encode labels manually — it detects the label dtype itself:


if y.dtype.kind in ("U", "S", "O"):   # string / bytes / object
    self._needs_encoding = True
    self.encoder.fit(y)
    y = self.encoder.transform(y)
else:
    y = y.astype(int)

and decodes predictions back automatically in predict() if _needs_encoding was set. This means RandomForest transparently accepts y = [0, 1, 1, 0] or y = ["cat", "dog", "dog", "cat"] with no manual preprocessing.

3.2 `Utils.candidate_thresholds` — why not test every value?

Why: if a numeric column has 10,000 unique values, testing every one as a threshold means 10,000 impurity calculations per feature, per node. That's wasteful, since most nearby thresholds give nearly identical splits.


@staticmethod
def candidate_thresholds(values, max_thresholds=32):
    unique_values = np.unique(values)
    thresholds = (unique_values[:-1] + unique_values[1:]) / 2.0  # midpoints

    if thresholds.size <= max_thresholds:
        return thresholds

    # evenly sample `max_thresholds` candidates across the range
    indices = np.linspace(0, thresholds.size - 1, max_thresholds, dtype=int)
    return thresholds[np.unique(indices)]

Example: column [10, 12, 15, 40, 42, 100] → midpoints [11, 13.5, 27.5, 41, 71]. If that's ≤ 32, all are kept as candidates; if there were thousands, only an evenly-spaced subset of 32 would be tried.

3.3 `Calculate` — Entropy, Gini, Gain, and Bootstrap in One Place

This class holds every formula from Part 1, plus bootstrap sampling and the majority-class rule.


def _entropy(self, y):
    counts = np.bincount(y)
    probs = counts / len(y)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))

def gini(self, y):
    counts = np.bincount(y)
    probs = counts / len(y)
    probs = probs[probs > 0]
    return 1.0 - np.sum(probs ** 2)

def H_parent(self, y):      return self._entropy(y)
def H_left(self, y_left):   return self._entropy(y_left)
def H_right(self, y_right): return self._entropy(y_right)

def H_after(self, y_left, y_right):
    n = len(y_left) + len(y_right)
    return (
        (len(y_left)  / n) * self._entropy(y_left)
      + (len(y_right) / n) * self._entropy(y_right)
    )

def gini_after(self, y_left, y_right):
    n = len(y_left) + len(y_right)
    return (
        (len(y_left)  / n) * self.gini(y_left)
      + (len(y_right) / n) * self.gini(y_right)
    )

def information_gain(self, y_parent, y_left, y_right):
    return self._entropy(y_parent) - self.H_after(y_left, y_right)

def gini_gain(self, y_parent, y_left, y_right):
    return self.gini(y_parent) - self.gini_after(y_left, y_right)

These are exactly the formulas from Part 1 — _entropy and gini implement H(S) and Gini(S), H_after/gini_after implement the weighted post-split average, and information_gain/gini_gain implement IG(D) = H(parent) - H_after.

Why bootstrap sampling, and why with replacement: Random Forest needs each tree to see a slightly different dataset — otherwise every tree learns the same patterns and voting wouldn't reduce error at all. Sampling with replacement (the same row can be picked more than once) creates that diversity cheaply:


def bootstrap_sample(self, X, y, rng):
    n_samples = X.shape[0]
    idxs = rng.choice(n_samples, size=n_samples, replace=True)
    oob_idxs = np.setdiff1d(np.arange(n_samples), idxs)   # rows NOT picked

    return X[idxs], y[idxs], X[oob_idxs], y[oob_idxs]

Example: with 8 rows, a bootstrap sample might draw indices [0,0,2,3,5,5,6,7] — row 0 appears twice, rows 1 and 4 are missing entirely. Those missing rows are the out-of-bag (OOB) samples this function also returns — the standard way to validate a Random Forest without a separate held-out test set, since each tree naturally has "unseen" data available. (This implementation returns OOB data but doesn't score it yet — a natural next step.)

Majority class, used whenever a node stops splitting and becomes a leaf:


def _Majority_class(self, y):
    return np.bincount(y).argmax()

3.4 `_Tree_Node` — why a class and not a dict?

Why: a node is one of exactly two things — a decision (feature + threshold + two children) or a leaf (value). A small class makes both cases representable with the same object, and node.value is not None is enough to tell them apart during traversal:


class _Tree_Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

3.5 `_Decision_Tree.fit` and `_grow_tree` — why recursive, and why stop?


def fit(self, X, y):
    self.n_features = self.n_features or X.shape[1]
    self.root = self._grow_tree(X, y, depth=0)

Why recursive: a tree is a split, whose children are themselves smaller trees. Recursion mirrors that structure directly — _grow_tree calls itself on the left subset and the right subset.


def _grow_tree(self, X, y, depth=0):
    n_samples, n_features = X.shape
    n_labels = len(np.unique(y))

    if depth >= self.max_depth or n_labels == 1 or n_samples < self.min_samples_split:
        return _Tree_Node(value=self.calc._Majority_class(y))

    best_feature, best_threshold = self._Find_Best_Split(X, y)

    if best_feature is None:
        return _Tree_Node(value=self.calc._Majority_class(y))

    left_idxs  = X[:, best_feature] <= best_threshold
    right_idxs = ~left_idxs

    left  = self._grow_tree(X[left_idxs],  y[left_idxs],  depth + 1)
    right = self._grow_tree(X[right_idxs], y[right_idxs], depth + 1)

    return _Tree_Node(feature=best_feature, threshold=best_threshold, left=left, right=right)

Why three stopping conditions:

depth >= max_depth — prevents the tree from growing forever (overfitting control).
n_labels == 1 — the node is already pure (matches H(S) = 0 from Part 1); splitting further can't improve it.
n_samples < min_samples_split — too few samples left to trust a further split statistically.

Without these, a tree would keep splitting until every leaf has exactly one sample — memorizing the training data instead of generalizing.

3.6 `_Find_Best_Split` — the split function, with feature bagging built in

This is where Random Forest randomness actually happens — not in the forest wrapper, but inside every single tree's split search:


def _Find_Best_Split(self, X, y):
    best_gain, best_feature, best_threshold = -1.0, None, None

    # random subset of feature indices — the key Random Forest trick
    feat_idxs = self.rng.choice(X.shape[1], self.n_features, replace=False)

    for feat in feat_idxs:
        col = X[:, feat]
        thresholds = self.calc.candidate_thresholds(col, self.max_thresholds)

        for threshold in thresholds:
            left_mask, right_mask = col <= threshold, ~(col <= threshold)
            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue

            if self.criterion == "entropy":
                gain = self.calc.information_gain(y, y[left_mask], y[right_mask])
            else:
                gain = self.calc.gini_gain(y, y[left_mask], y[right_mask])

            if gain > best_gain:
                best_gain, best_feature, best_threshold = gain, feat, threshold

    return best_feature, best_threshold

Why it randomly samples features first: this is feature bagging — instead of considering every column at every node, only a random subset (typically sqrt(total_features)) is considered. Combined with bootstrapping, this is what makes trees in the forest disagree with each other — and disagreement is exactly what majority voting needs in order to cancel out individual mistakes. Notice this happens at every node, not just once per tree — that's a stronger decorrelation effect than bagging features only at the root.

Why it loops feature → threshold → gain: it's an exhaustive-but-bounded search — try every allowed (feature, threshold) combination from candidate_thresholds, keep whichever gives the highest information_gain/gini_gain. This is "best split": no shortcut, just a bounded brute-force search guided directly by the IG formula from Part 1.

Worked example, using our running dataset: at the root, trying feature=age, threshold=3 gives left=[0,0,0] (pure), right=[1,1,1,1,1] (pure) → IG = 0.954 - 0 = 0.954, the maximum possible — so this becomes the root split.

3.7 `_traverse` and `predict` — prediction traversal


def _traverse(self, x, node):
    if node.value is not None:
        return node.value
    if x[node.feature] <= node.threshold:
        return self._traverse(x, node.left)
    return self._traverse(x, node.right)

def predict(self, X):
    return np.array([self._traverse(x, self.root) for x in X])

Why separate from training: growing a tree happens once; predicting happens for every new sample, potentially thousands of times. Keeping traversal as its own lightweight recursive function means prediction is just "follow the arrows" from root to leaf.

Example: sample x = [5, 24] on the tree trained above, with root threshold age <= 3 → 5 > 3 → go right → reaches a leaf with value = 1.

3.8 `RandomForest` — the ensemble wrapper


class RandomForest(BaseModel):
    def __init__(self, n_trees=10, max_depth=100, min_samples_split=2,
                 n_features=None, random_state=None, criterion="entropy", debug=False):
        self.n_trees = n_trees
        self.rng = np.random.default_rng(random_state)
        self.trees = []
        self.criterion = criterion
        self.encoder = LabelEncoderSimple(debug=debug)
        self._needs_encoding = False

        if criterion not in ("entropy", "gini"):
            raise ValueError(f"Invalid criterion '{criterion}'. Expected 'entropy' or 'gini'.")

Why a wrapper class at all, instead of training trees in a loop wherever needed: it centralizes everything that only makes sense at the forest level — shared randomness (rng), automatic label encoding, and the final voting step — so every individual tree can stay a simple, self-contained decision tree.

Training — bootstrap + bagging, one tree at a time:


def fit(self, X, y):
    # ...auto label-encoding (Section 3.1) happens here...
    self.trees = []
    for i in range(self.n_trees):
        X_sample, y_sample, _, _ = self._bootstrap_sample(X, y)

        tree = _Decision_Tree(
            min_samples_split=self.min_samples_split,
            max_depth=self.max_depth,
            n_features=self.n_features or max(1, int(np.sqrt(X.shape[1]))),
            criterion=self.criterion,
            rng=self.rng,
        )
        tree.fit(X_sample, y_sample)
        self.trees.append(tree)

Two defaults worth calling out:

If n_features isn't specified, it defaults to sqrt(total_features) — the standard Random Forest heuristic for classification.
All trees share one rng (np.random.default_rng(random_state)), passed down from the forest, so the whole forest is reproducible from a single random_state — not just the bootstrap step, but every feature-bagging draw inside every tree too.

Predicting — majority vote across trees:


def predict(self, X):
    tree_preds = np.array([tree.predict(X) for tree in self.trees])   # (n_trees, n_samples)

    final_preds = np.array([
        self._majority_vote(tree_preds[:, i]) for i in range(X.shape[0])
    ])

    if self._needs_encoding:
        return self.encoder.inverse_transform(final_preds)
    return final_preds

def _majority_vote(self, predictions):
    return np.bincount(predictions).argmax()

Why majority voting reduces error: if individual trees are only somewhat accurate but make different mistakes (thanks to bootstrapping + feature bagging), their errors don't line up. Voting cancels out the noise while keeping the shared signal — the statistical reason a Random Forest usually outperforms any single tree inside it.

Example: 5 trees predict [1, 1, 0, 1, 1] for the same sample → 4 votes for class 1, 1 vote for class 0 → final prediction is 1.

3.9 Putting It All Together


rf = RandomForest(n_trees=100, max_depth=5, criterion="gini", random_state=42, debug=True)
rf.fit(X_train, y_train)          # works with int OR string labels
preds = rf.predict(X_test)

Full flow:


data -> bootstrap_sample() (with OOB tracking)
     -> _Decision_Tree.fit() -> _grow_tree() (random feature subset + best split per node)
     -> _Decision_Tree.predict() -> _traverse()
     -> RandomForest._majority_vote()
     -> predictions (auto-decoded if labels were strings)

Part 4 — A Reusable Project Structure (Any Project, scikit-learn Included)

Whether you use this custom RandomForest or sklearn.ensemble.RandomForestClassifier, the surrounding workflow is always the same shape:


load data -> clean data -> split data -> scaling/encoding -> train model
          -> predict -> evaluate -> tune hyperparameters -> save model


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import joblib

# 1. Load data
df = pd.read_csv("dataset.csv")

# 2. Clean data — why: models can't handle missing values or duplicate rows reliably
df = df.dropna()
df = df.drop_duplicates()

# 3. Encode target — why: same reason as LabelEncoderSimple above, models need numbers
label_encoder = LabelEncoderSimple()
label_encoder.fit(df["target"])
df["target"] = label_encoder.transform(df["target"])

X = df.drop(columns=["target"]).values
y = df["target"].values

# 4. Split data — why: measure performance on data the model has never seen
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 5. Scale features — why: distance/gradient-based models need comparable feature ranges
#    (tree-based models like ours don't strictly require this, but it's part of the
#     general-purpose structure so the same pipeline works for other model types too)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 6. Train model
model = RandomForest(n_trees=100, max_depth=8, criterion="gini", random_state=42)
model.fit(X_train, y_train)

# 7. Predict
y_pred = model.predict(X_test)

# 8. Evaluate — why: a number to compare against other models/hyperparameters
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# 9. Tune hyperparameters — why: n_trees/max_depth/criterion all trade off bias vs. variance
best_score, best_params = 0, None
for n_trees in [10, 50, 100]:
    for depth in [5, 8, 12]:
        candidate = RandomForest(n_trees=n_trees, max_depth=depth, random_state=42)
        candidate.fit(X_train, y_train)
        score = accuracy_score(y_test, candidate.predict(X_test))
        if score > best_score:
            best_score, best_params = score, (n_trees, depth)

# 10. Save model — why: retrain is expensive; persist the trained object for reuse
joblib.dump(model, "random_forest_model.pkl")

This structure is what makes the custom RandomForest "drop-in": it exposes fit(X, y) and predict(X) just like scikit-learn's estimators, so it slots into the exact same 10-step pipeline without changing anything else.

Conclusion

Every piece of this implementation earns its place:

BaseModel gives every component a shared fit/predict contract and a debug-logging system.
Entropy and Gini quantify impurity — two roads to (almost) the same destination, one exact, one fast — exposed as a criterion="entropy"|"gini" choice.
Comparing before vs. after a split (H_parent vs. H_after) is the actual mechanism behind every decision a tree makes; Information Gain turns that comparison into a single number the tree can maximize.
LabelEncoderSimple plus automatic dtype detection means the forest accepts raw string labels with zero manual preprocessing.
candidate_thresholds keeps that maximization affordable on real-sized data by capping and evenly sampling threshold candidates instead of scanning every unique value.
Bootstrap sampling (with OOB tracking) and feature bagging at every node are what make the trees disagree with each other in useful ways.
Recursive growing with stopping conditions keeps trees from memorizing noise.
Majority voting turns many imperfect, uncorrelated trees into one Random Forest that's more accurate than any of them alone.
The load → clean → split → scale/encode → train → predict → evaluate → tune → save structure is the same regardless of which model sits inside it — from-scratch or scikit-learn.

The result is a Random Forest that behaves like scikit-learn's from the outside (fit, predict, string or numeric labels, criterion="gini"|"entropy") while being fully transparent — formula by formula — on the inside.

Full implementation available on GitHub: Random_Forest_From_Scratch

Understanding Decision Trees

Building a Decision Tree

Random Forests: The Ensemble Method

Implementing Random Forests from Scratch

Random Forest Classifier vs Random Forest Regressor

Random Forest Classifier

Random Forest Regressor

Key Difference

Conclusion

Random Forest From Scratch

Pipeline

Part 1 — The Formulas (with worked comparisons)

1.1 Entropy — H_parent, H_left, H_right

1.2 Gini Impurity

1.3 Mo9arana (Comparison): Entropy vs. Gini

1.4 Before vs. After Split — the Real Comparison

1.5 Information Gain — IG(D)

Part 2 — BaseModel: A Shared Foundation

Part 3 — The Code, Function by Function (with "why" and examples)

3.1 LabelEncoderSimple — why encode labels at all?

3.2 Utils.candidate_thresholds — why not test every value?

3.3 Calculate — Entropy, Gini, Gain, and Bootstrap in One Place

3.4 _Tree_Node — why a class and not a dict?

3.5 _Decision_Tree.fit and _grow_tree — why recursive, and why stop?

3.6 _Find_Best_Split — the split function, with feature bagging built in

3.7 _traverse and predict — prediction traversal

3.8 RandomForest — the ensemble wrapper

3.9 Putting It All Together

Part 4 — A Reusable Project Structure (Any Project, scikit-learn Included)

Conclusion

1.1 Entropy — `H_parent`, `H_left`, `H_right`

1.5 Information Gain — `IG(D)`

Part 2 — `BaseModel`: A Shared Foundation

3.1 `LabelEncoderSimple` — why encode labels at all?

3.2 `Utils.candidate_thresholds` — why not test every value?

3.3 `Calculate` — Entropy, Gini, Gain, and Bootstrap in One Place

3.4 `_Tree_Node` — why a class and not a dict?

3.5 `_Decision_Tree.fit` and `_grow_tree` — why recursive, and why stop?

3.6 `_Find_Best_Split` — the split function, with feature bagging built in

3.7 `_traverse` and `predict` — prediction traversal

3.8 `RandomForest` — the ensemble wrapper