Overview

KernelNet is a neural‑network framework built from the ground up in C++ and CUDA. It provides:

To obtain the KernelNet packages, see the KernelNet releases page, which contain the source code, the pre-built libraries kernelnet.dll, kernelnet.lib, and the /include folder.

Note: For brevity and clarity, all examples in this documentation use using namespace std;. This allows us to write vector, string, pair, etc., without the std:: prefix.

kernelnet::tensor::Tensor

The kernelnet::tensor::Tensor class represents a multi-dimensional array that supports operations on both CPU and CUDA (GPU) devices. It supports numerical arithmetic computations—including element-wise addition, subtraction, multiplication, matrix multiplication and transpose, as well as scalar multiplication, broadcast addition, summation, and argmax—providing a solid foundation for building neural networks.

Constructor

The kernelnet::tensor::Tensor class provides two constructors:

Methods

Utility Methods

Arithmetic and Matrix Operations

Example

The following code snippet demonstrates a basic example using CUDA. Two tensors are created, filled with constant values, transferred to CUDA, multiplied element-wise, and then the result is transferred back to CPU for display.


  #include "kernelnet.hpp"

  // Create two tensors on CPU, fill them, and then move them to CUDA.
  Tensor a(10, CPU);
  Tensor b(10, CPU);
  a.fill(2.0f);
  b.fill(3.0f);
  a.toCUDA();
  b.toCUDA();
  
  // Compute element-wise multiplication on CUDA.
  Tensor result = Tensor::multiply(a, b);
  
  result.toCPU();
  result.print();
        

kernelnet::autograd

The kernelnet::autograd module forms the backbone of KernelNet’s automatic differentiation engine. It enables gradient-based optimization by linking together differentiable operations. The two central classes are kernelnet::autograd::Variable and kernelnet::autograd::Function. While kernelnet::autograd::Variable encapsulates tensor data and manages gradient information, kernelnet::autograd::Function serves as an abstract base for all differentiable operations.

Autograd Classes

A dynamic computational graph is built during the forward pass. Every time a differentiable operation (a function derived from kernelnet::autograd::Function) is executed via its static kernelnet::autograd::Function::apply method, a new kernelnet::autograd::Variable node is created. These nodes encapsulate both the computed tensor value and a pointer (via the creator field) to the operation that produced them. Such intermediate nodes, created for operations like addition, multiplication, and slicing, are only needed during backpropagation to compute gradients. We use the type aliases using VarPtr = shared_ptr<Variable> and using FuncPtr = shared_ptr<Function> so that these temporary nodes are automatically deallocated once they are no longer referenced—typically after the backward pass has propagated gradients through the graph.

  • kernelnet::autograd::Variable
    • Constructor
      Variable::Variable(const Tensor &data, bool requires_grad = false, const string &name = "");

      Creates a new variable that wraps a given tensor. If requires_grad is set to true, a gradient tensor (of matching size and device) is created and initialized to zero.

    • set_creator(const FuncPtr &func)
      void Variable::set_creator(const FuncPtr &func);

      Assigns the creator function (i.e., the operation that produced this variable).

    • backward(const Tensor &grad_output)
      void Variable::backward(const Tensor &grad_output);

      Initiates the backpropagation process. It accumulates gradients, and once all contributions are received, it propagates the gradient further by calling the backward method of its creator.

    • detach()
      VarPtr Variable::detach();

      Returns a detached copy of the variable that does not track gradients.

  • kernelnet::autograd::Function
    • Abstract Base Class

      Serves as the base for all differentiable operations.

    • backward(const Tensor &grad_output)
      virtual vector<Tensor> Function::backward(const Tensor &grad_output) = 0;

      A pure virtual method that must be implemented by derived classes. It receives the upstream gradient and computes gradients for each input.

  • Child Functions

    The autograd module provides several built‐in differentiable function implementations that derive from the base kernelnet::autograd:Function class. Each derived function implements a static apply method that computes its output—returning a kernelnet::autograd::Variable—while saving its inputs and parameters for the backward pass. Their corresponding backward methods receive the gradient from the next layer and compute gradients for each input.

    Example

    The following snippet demonstrates using the Mean Squared Error (MSE) loss with KernelNet’s autograd system. Two tensors are created on the CPU (one for prediction and one for target), and the prediction is wrapped in a Variable that requires gradients. The MSE loss is computed, and an upstream gradient of 1 is used to backpropagate the gradients.

    
    #include "kernelnet.hpp"
    
    // Create prediction and target tensors (size 5) on CPU.
    Tensor msePred(5, CPU);
    Tensor mseTarget(5, CPU);
    msePred.fill(2.0f);     
    mseTarget.fill(3.0f);   
    
    // Wrap the prediction in a Variable that tracks gradients.
    auto varPred = make_shared<Variable>(msePred, true);
    
    // Compute MSE loss: loss = mean((prediction - target)^2).
    auto mseLoss = MSEFunction::apply(varPred, mseTarget);
    
    // Create an upstream gradient (of ones) for the backward pass.
    Tensor gradOutput(mseLoss->data.size(), CPU);
    gradOutput.fill(1.0f);
    
    // Perform backpropagation.
    mseLoss->backward(gradOutput);
    
    varPred->grad.print();
        

    kernelnet::nn::Module

    The kernelnet::nn::Module class defines the abstract base interface for all neural network layers or components in KernelNet. It standardizes how modules perform their forward computation, how their parameters are accessed for optimization, and how gradients are reset before each optimizer step.

    Every custom layer (such as Dense, Conv2D, or LSTM layers) should inherit from kernelnet::nn::Module and override the kernelnet::nn::Module::forward method. Additionally, modules can override the parameters method to expose their trainable parameters.

    Methods

    kernelnet::nn::SingleInputModule

    The kernelnet::nn::SingleInputModule class serves as an abstract base for neural network layers or components that operate on exactly one input and produce one output, inherited from kernelnet::nn::Module.

    Methods

    kernelnet::nn::Sequential

    The kernelnet::nn::Sequential container module provides a way to stack layers linearly. Inheriting from kernelnet::nn::SingleInputModule, each submodule in a Sequential container accepts a single VarPtr as input and returns a single VarPtr as output.

    Constructor

    The kernelnet::nn::Sequential container provides two constructors:

    Methods

    Example

    Two layers are added to a Sequential container, and a forward pass is executed.

    
      #include "kernelnet.hpp"
    
      auto layer1 = make_shared<Dense>(input_dim, hidden_dim, device);
      auto layer2 = make_shared<Dense>(hidden_dim, output_dim, device);
      Sequential model = { layer1, layer2 };
      
      VarPtr output = model.forward(input);
            

    kernelnet::nn::Sigmoid

    The kernelnet::nn::Sigmoied module implements the sigmoid activation function, defined as:

    sigmoid(x) = 1 / (1 + exp(-x))

    It is implemented using an autograd-compatible function kernelnet::nn::SigmoidFunction inherited from kernelnet::autograd::Function that computes both the forward pass and the gradients during backpropagation. The user-facing kernelnet::nn::Sigmoid module wraps this functionality to integrate into building the customized models.

    Sigmoid Function

    The kernelnet::nn::SigmoidFunction class, inherited from kernelnet::autograd::Function, provides the autograd-compatible implementation of the sigmoid activation.

    In the forward pass, the input tensor is transformed into its sigmoid representation. During the backward pass, it computes the gradient based on the derivative: y * (1 - y), where y is the sigmoid output.

    Methods:

    Sigmoid Module

    The kernelnet::nn::Sigmoid class is a user-facing module that inherits from kernelnet::nn::SingleInputModule. It acts as a wrapper for the kernelnet::nn::SigmoidFunction.

    In its forward pass, the kernelnet::nn::Sigmoid module calls kernelnet::nn::SigmoidFunction::apply on the input variable.

    
      Sigmoid::Sigmoid() {} // Constructor
      
      VarPtr Sigmoid::forward(const VarPtr &input) {
          return SigmoidFunction::apply(input);
      }
            

    kernelnet::nn::Softmax

    The kernelnet::nn::Softmax module implements the softmax activation function, which normalizes raw input scores into a probability distribution. It is defined as:

    softmax(x_i) = exp(x_i) / sum_j exp(x_j)

    The module is built upon an autograd-compatible function, kernelnet::nn::SoftmaxFunction, which handles both the forward computation (applying softmax) and the backward computation (calculating gradients for backpropagation). The user-facing kernelnet::nn::Softmax class wraps this functionality to integrate into building the customized models.

    Softmax Function

    The kernelnet::nn::SoftmaxFunction class, inherited from kernelnet::autograd::Function, provides the autograd-compatible implementation of the softmax activation. It computes the forward pass by normalizing input scores along the last dimension of each sample, and caches the output for use in the backward pass.

    During the backward pass, it calculates the gradient with respect to the input using the formula:

    dL/dxi = yi * (dL/dyi - sumj(dL/dyj * yj)), where y = softmax(x).

    Methods:

    Softmax Module

    The kernelnet::nn::Softmax class is a user-facing module that inherits from kernelnet::nn::SingleInputModule. It acts as a wrapper for kernelnet::nn::SoftmaxFunction.

    In its forward pass, the kernelnet::nn::Softmax module invokes SoftmaxFunction::apply, passing the input variable along with the defined batch_size and num_classes, to compute the softmax activation.

    
        Softmax::Softmax(int batch_size, int num_classes)
            : batch_size(batch_size), num_classes(num_classes) {}
    
        VarPtr Softmax::forward(const VarPtr &input) {
            return SoftmaxFunction::apply(input, batch_size, num_classes);
        }
        

    kernelnet::nn::Tanh

    The kernelnet::nn::Tanh module implements the hyperbolic tangent (tanh) activation function, defined as:

    tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

    This module integrates with the autograd system by leveraging the kernelnet::nn::TanhFunction class, which handles both the forward computation and the gradient calculation needed during backpropagation.

    Tanh Function

    The kernelnet::nn::TanhFunction class provides the autograd-compatible implementation of the tanh activation.

    In the forward pass, the input tensor is transformed element-wise using: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)). The output is stored for later use in the backward pass.

    During the backward pass, the gradient is computed with the derivative: dL/dx = dL/dy * (1 - tanh(x)^2).

    Methods:

    Tanh Module

    The kernelnet::nn::Tanh class serves as the user-facing module that encapsulates the functionality of TanhFunction, inheriting from kernelnet::nn::SingleInputModule.

    In its forward pass, the Tanh module calls TanhFunction::apply on the input variable to perform the tanh activation.

    
    Tanh::Tanh() {} // Constructor
    
    VarPtr Tanh::forward(const VarPtr &input) {
        return TanhFunction::apply(input);
    }
        

    kernelnet::nn::ReLU

    The kernelnet::nn::ReLU module implements the Rectified Linear Unit (ReLU) activation function, defined as:

    ReLU(x) = max(0, x)

    This module integrates with the autograd system by utilizing the kernelnet::nn::ReLUFunction class, which manages both the forward computation and the backward gradient propagation during backpropagation.

    ReLU Function

    The kernelnet::nn::ReLUFunction class, inherited from kernelnet::autograd::Function, provides an autograd-compatible implementation of the ReLU activation. It computes the forward pass by applying: ReLU(x) = (x > 0) ? x : 0 element-wise on the input and caches the output for use in the backward pass.

    For the backward pass, the gradient is computed only for the positive elements of the input using: grad_in = grad_output * (x > 0 ? 1 : 0).

    Methods:

    ReLU Module

    The kernelnet::nn::ReLU class is a user-facing module that inherits from kernelnet::nn::SingleInputModule and serves as a wrapper around ReLUFunction.

    In its forward pass, the module calls ReLUFunction::apply on the input variable to compute the ReLU activation, with support for both CPU and CUDA computations.

    
    ReLU::ReLU() {} // Constructor
    
    VarPtr ReLU::forward(const VarPtr &input) {
        return ReLUFunction::apply(input);
    }
            

    kernelnet::nn::Dense

    The kernelnet::nn::Dense layer is a fully connected (linear) neural network layer. It performs a linear transformation on input data by computing:

    output = input × weightT + bias

    Here, the weight matrix has dimensions (output_dim × input_dim) and the bias vector has dimensions (output_dim). The Dense layer automatically manages the gradient computations during backpropagation.

    Constructor

    The kernelnet::nn::Dense layer, inherited from kernelnet::nn::SingleInputModule, is constructed by providing:

    Dense::Dense(int input_dim, int output_dim, Device device)
          : input_dim(input_dim), output_dim(output_dim) {}

    Methods

    Example

    
      #include "kernelnet.hpp"
    
      // Create a Dense layer with 128 input features and 64 output features on CUDA.
      Dense denseLayer(128, 64, CUDA);
      
      // Create an input tensor (e.g., batch_size * input_dim).
      Tensor inputTensor(128 * batch_size, CPU);
      inputTensor.fill(1.0f);
      inputTensor.toCUDA();
      
      // Wrap the input tensor in a Variable.
      auto inputVar = make_shared<Variable>(inputTensor, true);
      
      VarPtr outputVar = denseLayer.forward(inputVar);
      
      outputVar->print();
            

    kernelnet::nn::Embedding

    The kernelnet::nn::Embedding layer converts token indices into dense embedding vectors using a learnable weight matrix. Internally, the embedding lookup is performed by the Embedding::EmbeddingLookupFunction inherited from kernelnet::autograd::Function, an autograd-compatible function that caches token indices during the forward pass and accumulates gradients for the corresponding rows of the weight during the backward pass.

    Constructor

    The kernelnet::nn::Embedding layer, inherited from kernelnet::nn::SingleInputModule, is constructed by specifying:

    Embedding::Embedding(int vocab_size, int embed_dim, Device dev = CPU)
          : vocab_size(vocab_size), embed_dim(embed_dim) {}

    Methods

    Example

    
      #include "kernelnet.hpp"
    
      // Create an Embedding layer with a vocabulary of 1000 tokens and embeddings of size 128 on CUDA.
      Embedding embeddingLayer(1000, 128, CUDA);
      
      Tensor tokenIndices(10, CPU);
      for (size_t i = 0; i < tokenIndices.size(); i++) {
          tokenIndices.data()[i] = static_cast(i % 1000);
      }
      tokenIndices.toCUDA();
      
      // Wrap the token indices in a Variable.
      auto indicesVar = make_shared<Variable>(tokenIndices, false);
      
      VarPtr embeddings = embeddingLayer.forward(indicesVar);
      
      embeddings->print();
            

    kernelnet::nn::Conv2D

    The kernelnet::nn::Conv2D layer is a learnable 2D convolutional module designed for processing image data. It applies a set of convolutional kernels (filters) to an input tensor (e.g., an image or feature map) and adds a bias, computing the operation:

    output = input × weight^T + bias

    Constructor

    The kernelnet::nn::Conv2D layer, inherited from kernelnet::nn::SingleInputModule, is constructed by specifying:

    Conv2D::Conv2D(int in_channels, int out_channels, int kernel_h, int kernel_w,
                 int input_height, int input_width, int stride, int padding, Device device)
          : in_channels(in_channels), out_channels(out_channels),
            kernel_h(kernel_h), kernel_w(kernel_w),
            stride(stride), padding(padding),
            input_height(input_height), input_width(input_width) {}
            

    Methods

    Example

    
      #include "kernelnet.hpp"
    
      // Create a Conv2D layer with 3 input channels, 16 output channels,
      // a 3x3 kernel, input dimensions 32x32, stride 1, and no padding on CUDA.
      Conv2D convLayer(3, 16, 3, 3, 32, 32, 1, 0, CUDA);
      
      // Create an input tensor for a batch of 10 samples.
      Tensor inputTensor(10 * 3 * 32 * 32, CPU);
      inputTensor.fill(1.0f);
      inputTensor.toCUDA();
      
      // Wrap the input tensor into a Variable.
      auto inputVar = make_shared<Variable>(inputTensor, true);
      
      VarPtr outputVar = convLayer.forward(inputVar);
    
      outputVar->toCPU();
      outputVar->print();
            

    kernelnet::nn::MaxPool2D

    The kernelnet::nn::MaxPool2D layer performs spatial downsampling by applying a max pooling operation. It takes an input tensor with shape [batch_size, channels, input_height, input_width] and outputs a pooled tensor with reduced spatial dimensions. During the forward pass, the maximum value within each pooling window is selected, and its index is stored for use in the backward pass.

    Constructor

    The kernelnet::nn::MaxPool2D module, inherited from kernelnet::nn::SingleInputModule. is initialized with the following parameters:

    MaxPool2D::MaxPool2D(int kernel_size, int stride,
          int batch_size, int channels,
          int input_height, int input_width)
          : kernel_size(kernel_size), stride(stride),
            batch_size(batch_size), channels(channels),
            input_height(input_height), input_width(input_width) {}

    Methods

    Example

    
      #include "kernelnet.hpp"
    
      // Create a MaxPool2D layer with a 2x2 pooling window and stride of 2, for a batch of 10 samples and 3 channels, with input size 32x32.
      MaxPool2D maxPoolLayer(2, 2, 10, 3, 32, 32);
      
      // Create an input tensor of shape (10 * 3 * 32 * 32).
      Tensor inputTensor(10 * 3 * 32 * 32, CPU);
      inputTensor.fill(1.0f);
      
      auto inputVar = make_shared<Variable>(inputTensor, true);
      
      VarPtr outputVar = maxPoolLayer.forward(inputVar);
    
      outputVar->print();
            

    kernelnet::nn::LSTMCell

    The kernelnet::nn::LSTMCell module implements a single time‐step of an LSTM. It encapsulates learnable parameters—input‐to‐hidden and hidden‐to‐hidden weights and biases— and internally invokes LSTMCellFunction::apply to build the autograd graph. On forward, it computes the four gates (input, forget, cell, output), updates the cell and hidden states, and caches all intermediates required for backward.

    Constructor

    The kernelnet::nn::LSTMCell module, inherited from kernelnet::nn::Module, is initialized with the following parameters:

    
      LSTMCell::LSTMCell(int input_dim, int hidden_dim, Device device = CPU)
          : input_dim(input_dim), hidden_dim(hidden_dim), device(device) {}
            

    Methods

    Example

    
      #include "kernelnet.hpp"
    
      // Create cell for input_dim=16, hidden_dim=32 on CPU
      LSTMCell lstm(16, 32, CPU);
      
      // Prepare dummy inputs
      Tensor x(10 * 16, CPU); x.fill(0.1f);
      Tensor h(10 * 32, CPU); h.fill(0.0f);
      Tensor c(10 * 32, CPU); c.fill(0.0f);
      auto x_var = make_shared<Variable>(x, true);
      auto h_var = make_shared<Variable>(h, true);
      auto c_var = make_shared<Variable>(c, true);
      
      auto [h_new, c_new] = lstm.forward({x_var, h_var, c_var});
            

    kernelnet::nn::LSTM

    The kernelnet::nn::LSTM module wraps a single kernelnet::nn::LSTMCell to process an entire sequence of length sequence_length. Given a flattened input tensor of shape [batch_size * sequence_length * input_dim], it slices out each time‐step, feeds it (along with the running hidden and cell state) into the internal kernelnet::nn::LSTMCell, and finally concatenates all hidden‐state outputs into one long tensor.

    Constructor

    The kernelnet::nn::LSTM module, inherited from kernelnet::nn::SingleInputModule, is initialized with the following parameters:

    
    LSTM::LSTM(int batch_size, int sequence_length, int input_dim, int hidden_dim, Device device)
      : batch_size(batch_size), sequence_length(sequence_length), input_dim(input_dim), hidden_dim(hidden_dim), device(device), cell(input_dim, hidden_dim, device) {}
          

    Internally constructs a single kernelnet::nn::LSTMCell with the same input_dim and hidden_dim.

    Methods

    Example

    
    #include "kernelnet.hpp"
    
    // Unrolling an LSTM over a 20‐step sequence of 8‐dim inputs and 16-dim outputs with batch size 4
    LSTM lstm(4, 20, 8, 16, CPU);
    
    Tensor seq_in(4 * 20 * 8, CPU);
    seq_in.fill(0.5f);
    auto seq_var = make_shared<Variable>(seq_in, true);
    
    auto out = lstm.forward(seq_var);
          

    kernelnet::optim::SGD

    kernelnet::optim::SGD implements stochastic gradient descent with an optional per‑element clipping of gradients. On each step() it optionally rescales any gradient whose ℓ₂‑norm exceeds clip_value, then updates each parameter:

    param = param - lr * grad

    Constructor

    - params: List of trainable variables
    - lr: Learning rate
    - clip_value: Maximum allowed gradient norm (0 = no clipping)

    SGD(const vector<VarPtr>& params, float lr, float clip_value = 0.0f); 

    Methods

    • step()
      void SGD::step();

      Optionally clips each parameter’s gradient, then updates param[i] -= lr * grad[i].

    • zero_grad()
      void SGD::zero_grad();

      Sets all gradients to zero and resets their initialization flags.

    kernelnet::trainer::Trainer

    The kernelnet::trainer::Trainer class wraps a kernelnet::nn::Sequential model, an kernelnet::optim::SGD optimizer, and a pluggable loss function inherited from kernelnet::autograd::Function. Its trainEpoch method runs, for each sample:

    1. Forward through the model
    2. Compute loss via the provided LossFunction
    3. Backward pass to populate gradients
    4. Optimizer step()
    5. Optimizer zero_grad()

    Constructor

    • model: The customized kernelnet::nn::Sequential model
    • optimizer: A kernelnet::optim::SGD instance
    • loss_fn: Function inherited from kernelnet::autograd::Function taking (prediction, target_tensor) → scalar loss
    Trainer(const shared_ptr& model, const SGD& optimizer, LossFunction loss_fn = MSEFunction::apply);

    Methods

    • trainEpoch(const vector<VarPtr>& inputs, const vector<VarPtr>& targets)
      trainEpoch(const vector<VarPtr>& inputs, const vector<VarPtr>& targets);

      Runs one epoch over the given input–target pairs, performing forward, loss, backward, update, and zero‑grad for each sample.

    kernelnet::data

    The kernelnet::data module provides utilities for loading and batching datasets. Currently supported:

    • CIFAR‑10 (image classification)
    • Penn Treebank(PTB) (language modeling)

    kernelnet::data::CIFAR10Dataset

    Loads CIFAR‑10 binary files, normalizes images, and one‑hot encodes labels.

    
    class CIFAR10Dataset {
    public:
      CIFAR10Dataset(const string &data_dir, bool train);
      size_t size() const;
      const CIFAR10Sample& getSample(size_t index) const;
    };

    kernelnet::data::CIFAR10DataLoader

    Wraps a kernelnet::data::CIFAR10Dataset for shuffled mini‑batch iteration.

    
    class CIFAR10DataLoader {
    public:
      CIFAR10DataLoader(CIFAR10Dataset &dataset, int batch_size, bool shuffle=true);
      void reset();
      bool hasNext() const;
      pair<Tensor,Tensor> nextBatch();
    };

    kernelnet::data::PTBDataset

    Loads a Penn Treebank text file, builds a word‐to‐index vocabulary, and slices the token stream into fixed‐length (input, target) sequence pairs.

    
    struct PTBSample {
      Tensor input;   // length = sequence_length
      Tensor target;  // length = sequence_length
    };
    
    class PTBDataset {
    public:
      PTBDataset(const string &file, int sequence_length);
      size_t size() const;
      const PTBSample& getSample(size_t index) const;
    private:
      void loadFile(const string &filename);
      void buildVocabulary(const vector<string> &tokens);
    };

    kernelnet::data::PTBDataLoader

    Wraps a kernelnet::data::PTBDataset for shuffled mini‑batch retrieval of sequence samples.

    
    class PTBDataLoader {
    public:
      PTBDataLoader(PTBDataset &dataset, int batch_size, bool shuffle=true);
      void reset();
      bool hasNext() const;
      pair<Tensor,Tensor> nextBatch();
    private:
      PTBDataset &dataset;
      int batch_size;
      bool shuffle;
      vector indices;
      size_t current_index;
    };

    Example

    Using the kernelnet::data utilities to benchmark training and evaluation on the CIFAR-10 dataset:

    
          #include "kernelnet.hpp"
          #include <chrono>
    
          void runCIFAR10Tests() {
            Device dev = CPU;
        
            int batch_size = 256;
            int num_epochs = 100;
            int num_classes = 10;
            int image_height = 32, image_width = 32, in_channels = 3;
        
            // --- Load CIFAR-10 Data ---
            CIFAR10Dataset trainDataset("data/cifar10/train", true);
            CIFAR10Dataset testDataset("data/cifar10/test", false);
            CIFAR10DataLoader trainLoader(trainDataset, batch_size, true);
            CIFAR10DataLoader testLoader(testDataset, batch_size, false);
        
            // --- Define Model Architecture ---
            // conv1: input channels 3 → 16, output dims remain 32×32
            auto conv1 = make_shared<Conv2D>(in_channels, 16, 3, 3, image_height, image_width, 1, 1, dev);
            auto pool1 = make_shared<MaxPool2D>(2, 2, batch_size, 16, image_height, image_width); // → 16×16
        
            auto conv2 = make_shared<Conv2D>(16, 32, 3, 3, image_height / 2, image_width / 2, 1, 1, dev);
            auto pool2 = make_shared<MaxPool2D>(2, 2, batch_size, 32, image_height / 2, image_width / 2); // → 8×8
        
            auto dense = make_shared<Dense>(32 * 8 * 8, num_classes, dev);
            auto softmax = make_shared<Softmax>(batch_size, num_classes);
        
            // Assemble model into a Sequential container
            shared_ptr<Sequential> model = make_shared<Sequential>(initializer_list<shared_ptr<SingleInputModule>>{
                conv1, pool1, conv2, pool2, dense, softmax});
        
            // --- Set Optimizer ---
            vector<VarPtr> params = model->parameters();
            float learning_rate = 0.01f;
            SGD optimizer(params, learning_rate);
        
            // --- Define Loss Function ---
            LossFunction loss_fn = [num_classes](const VarPtr &prediction, const Tensor &target) {
                return CrossEntropyLossFunction::apply(prediction, target, num_classes);
            };
        
            // --- Create Trainer ---
            Trainer trainer(model, optimizer, loss_fn);
        
            // --- Training Loop ---
            auto start = high_resolution_clock::now();
            for (int epoch = 0; epoch < num_epochs; epoch++) {
                float epoch_loss = 0.0f;
                int batches = 0;
        
                while (trainLoader.hasNext()) {
                    auto batch = trainLoader.nextBatch();
                    if (dev == CUDA) {
                        batch.first.toCUDA();
                        batch.second.toCUDA();
                    }
        
                    VarPtr input_var = make_shared<Variable>(batch.first, false, "input_batch");
                    VarPtr target_var = make_shared<Variable>(batch.second, false, "target_batch");
        
                    vector<VarPtr> inputs = {input_var};
                    vector<VarPtr> targets = {target_var};
                    trainer.trainEpoch(inputs, targets);
        
                    VarPtr prediction = model->forward(input_var);
                    VarPtr loss = loss_fn(prediction, batch.second);
                    float batch_loss = loss->data.sum();
        
                    epoch_loss += batch_loss;
                    batches++;
                }
        
                trainLoader.reset();
                cout << "Epoch " << epoch << " Average Loss: " << epoch_loss / batches << endl;
            }
        
            auto end = high_resolution_clock::now();
            auto duration = duration_cast<milliseconds>(end - start).count();
            cout << "Custom architecture training completed in " << (duration / 1000.0) << " seconds" << endl;
        
            // --- Evaluate on Test Set ---
            int correct = 0, total = 0;
            while (testLoader.hasNext()) {
                auto batch = testLoader.nextBatch();
        
                if (dev == CUDA) {
                    batch.first.toCUDA();
                    batch.second.toCUDA();
                }
        
                VarPtr input_var = make_shared<Variable>(batch.first, false, "test_input");
                VarPtr prediction = model->forward(input_var);
        
                vector<int> pred_labels = prediction->data.argmax(1, num_classes);
                vector<int> true_labels = batch.second.argmax(1, num_classes);
        
                for (size_t i = 0; i < pred_labels.size(); ++i) {
                    if (pred_labels[i] == true_labels[i])
                        correct++;
                    total++;
                }
            }
        
            cout << "KenelNet Test Accuracy: " << (100.0 * correct / total) << "%" << endl;
        }
        

    Using the kernelnet::data utilities to benchmark training and evaluation on the PTB dataset:

    
    #include "kernelnet.hpp"  
    #include <chrono>
    
    // Helper function: converts a tensor of token indices to a one-hot encoded tensor
    inline Tensor onehot(const Tensor &indices, int num_classes);
    int runPTBTests() {
      // --- Hyperparameters and Device Setup ---
      Device dev = CPU;
      int batch_size = 8;
      int num_epochs = 20;
      int sequence_length = 35;
      int embed_dim = 128;
      int hidden_dim = 256;
    
      // --- Load PTB Data ---
      PTBDataset trainDataset("data/ptb/ptb.train.txt", sequence_length);
      PTBDataset testDataset("data/ptb/ptb.test.txt", sequence_length);
      PTBDataLoader trainLoader(trainDataset, batch_size, true);
      PTBDataLoader testLoader(testDataset, batch_size, false);
    
      int vocab_size = trainDataset.vocab_size;
    
      // --- Build Model ---
      // Model architecture: Embedding → LSTM (unrolled) → Dense → Softmax.
      auto embedding = make_shared<Embedding>(vocab_size, embed_dim, dev);
      auto lstm = make_shared<LSTM>(batch_size, sequence_length, embed_dim, hidden_dim, dev);
      auto dense = make_shared<Dense>(hidden_dim, vocab_size, dev);
      auto softmax = make_shared<Softmax>(batch_size * sequence_length, vocab_size);
    
      // Assemble the model using the kernelnet::nn::Sequential container.
      auto model = make_shared<Sequential>(initializer_list<shared_ptr<SingleInputModule>>{
          embedding,
          lstm,
          dense,
          softmax});
    
      // --- Setup Optimizer ---
      vector<VarPtr> params = model->parameters();
      float learning_rate = 0.01f;
      SGD optimizer(params, learning_rate);
    
      // --- Define Loss Function Lambda ---
      LossFunction loss_fn = [vocab_size](const VarPtr &prediction, const Tensor &target) {
          Tensor onehot_target = onehot(target, vocab_size);
          return CrossEntropyLossFunction::apply(prediction, onehot_target, vocab_size);
      };
    
      // --- Create Trainer ---
      // Trainer accepts model, optimizer, and the loss function.
      Trainer trainer(model, optimizer, loss_fn);
    
      // --- Training Loop ---
      auto start = high_resolution_clock::now();
      for (int epoch = 0; epoch < num_epochs; epoch++) {
          float epoch_loss = 0.0f;
          int batches = 0;
          while (trainLoader.hasNext()) {
              auto batch = trainLoader.nextBatch();
    
              if (dev == CUDA) {
                  batch.first.toCUDA();
                  batch.second.toCUDA();
              }
    
              // Wrap input into a Variable (target is passed as Tensor).
              VarPtr input_var = make_shared<Variable>(batch.first, true, "input_batch");
              VarPtr target_var = make_shared<Variable>(batch.second, false, "target_batch");
    
              // Trainer.trainEpoch() takes a vector of input Variables and a vector of target Tensors.
              vector<VarPtr> inputs = {input_var};
              vector<VarPtr> targets = {target_var};
    
              trainer.trainEpoch(inputs, targets);
    
              // For logging, compute loss separately:
              VarPtr prediction = model->forward(input_var);
    
              VarPtr loss = loss_fn(prediction, batch.second);
    
              float batch_loss = loss->data.sum();
              epoch_loss += batch_loss;
              batches++;
          }
          trainLoader.reset();
          cout << "Epoch " << epoch << " Average Loss: " << (epoch_loss / batches) << endl;
      }
      auto end = high_resolution_clock::now();
      auto duration = duration_cast<milliseconds>(end - start).count();
      cout << "LSTM training completed in " << (duration / 1000.0) << " seconds." << endl;
    
      // --- Evaluation: Compute Perplexity on Validation Set ---
      float total_loss = 0.0f;
      int total_tokens = 0;
      while (testLoader.hasNext()) {
          auto batch = testLoader.nextBatch();
          if (dev == CUDA) {
              batch.first.toCUDA();
              batch.second.toCUDA();
          }
          VarPtr input_var = make_shared<Variable>(batch.first, false, "valid_input");
          VarPtr prediction = model->forward(input_var);
          VarPtr loss = loss_fn(prediction, batch.second);
          total_loss += loss->data.sum();
          total_tokens += batch.second.size();
      }
      float avg_loss = total_loss / total_tokens;
      float perplexity = exp(avg_loss);
      cout << "Validation Perplexity: " << perplexity << endl;
    }