Overview

KernelNet is a neural‑network framework built from the ground up in C++ and CUDA. It provides:

Custom Tensor & Autograd: Multi‑dimensional arrays with element‑wise ops, matrix multiply/transpose, broadcasting, reductions, argmax, and built‑in automatic differentiation for gradient back‑propagation on both CPU and GPU.
Neural Building Blocks: Layers for feed‑forward (Dense), convolutional (Conv2D), pooling (MaxPool2D), recurrent (LSTM), and embedding operations.
Activation Functions: ReLU, Sigmoid, Tanh, and Softmax modules that integrate into customized models.
Optimization & Loss: A simple SGD optimizer (with optional gradient clipping), plus MSE and Cross‑Entropy loss functions.
Hardware Agnostic: Write once—run on CPU or accelerate with CUDA-enabled GPUs.
Benchmarking Pipelines: Ready‑to‑use data loaders and example scripts for CIFAR‑10 and Penn Treebank to measure both performance and accuracy.

To obtain the KernelNet packages, see the KernelNet releases page, which contain the source code, the pre-built libraries kernelnet.dll, kernelnet.lib, and the /include folder.

Note: For brevity and clarity, all examples in this documentation use using namespace std;. This allows us to write vector, string, pair, etc., without the std:: prefix.

kernelnet::tensor::Tensor

The kernelnet::tensor::Tensor class represents a multi-dimensional array that supports operations on both CPU and CUDA (GPU) devices. It supports numerical arithmetic computations—including element-wise addition, subtraction, multiplication, matrix multiplication and transpose, as well as scalar multiplication, broadcast addition, summation, and argmax—providing a solid foundation for building neural networks.

Constructor

The kernelnet::tensor::Tensor class provides two constructors:

Default Constructor
```
Tensor::Tensor() : _size(0), _data_host(nullptr), _data_device(nullptr), _device(CPU) {}
```
Creates a zero-sized tensor on the CPU as the default device with no allocated memory.
Parameterized Constructor
```
Tensor::Tensor(size_t size, Device device);
```
Constructs a tensor with the specified number of elements and target device. It allocates host memory immediately and, if the device is CUDA, also allocates device memory and copies the host data.

Methods

Utility Methods

fill(float val)
```
void Tensor::fill(float val);
```
Fills the tensor with a constant value. The host memory is updated and, if on CUDA, the data is copied to the device.
=(const Tensor &other)
```
Tensor &operator=(const Tensor &other);
```
Frees the existing memory and deep copies data from another tensor.
print() const
```
void Tensor::print() const;
```
Prints the tensor’s values to the console. For a CUDA tensor, its data is first copied to the host.
data()
```
float* Tensor::data();
const float* Tensor::data() const;
            
```
Returns a pointer to the underlying memory. Depending on the device, it will reference host or device memory.
size() const
```
size_t Tensor::size() const;
```
Returns the total number of elements in the tensor.
device() const
```
Device Tensor::device() const;
```
Retrieves the current device (either CPU or CUDA) where the tensor is stored.
toCUDA() / toCPU()
```
void Tensor::toCUDA();
void Tensor::toCPU();
            
```
toCUDA() transfers the tensor to GPU memory (allocating device memory as needed), while toCPU() transfers the data back to host memory and frees any allocated CUDA memory.
free()
```
void Tensor::free();
```
Deletes the tensor's host memory and frees device memory if allocated.

Arithmetic and Matrix Operations

add(const Tensor &a, const Tensor &b)
```
static Tensor Tensor::add(const Tensor &a, const Tensor &b);
```
Performs element-wise addition between two tensors. Assumes both tensors have the same size and reside on the same device.
subtract(const Tensor &a, const Tensor &b)
```
static Tensor Tensor::subtract(const Tensor &a, const Tensor &b);
```
Performs element-wise subtraction between two tensors. Assumes both tensors have the same size and reside on the same device.
multiply(const Tensor &a, const Tensor &b)
```
static Tensor Tensor::multiply(const Tensor &a, const Tensor &b);
```
Computes element-wise multiplication between two tensors. Assumes both tensors have the same size and reside on the same device.
broadcast_add(const Tensor &a, const Tensor &b)
```
static Tensor Tensor::broadcast_add(const Tensor &a, const Tensor &b);
```
If the tensors are of equal size, performs regular addition. If one tensor’s size divides the other, the smaller tensor is broadcast along that dimension.
matmul(const Tensor &a, const Tensor &b, int M, int K, int N)
```
static Tensor Tensor::matmul(const Tensor &a, const Tensor &b, int M, int K, int N);
```
Performs matrix multiplication between tensor a (of shape MxK) and tensor b (of shape KxN).
transpose(const Tensor &a, int rows, int cols)
```
static Tensor Tensor::transpose(const Tensor &a, int rows, int cols);
```
Returns the transposed tensor by treating it as a 2D matrix.
scalar_multiply(const Tensor &a, float scalar)
```
static Tensor Tensor::scalar_multiply(const Tensor &a, float scalar);
```
Multiplies every element in the tensor by a scalar value.
argmax() const
```
int Tensor::argmax() const;
```
Returns the index of the maximum element in the flattened tensor.
argmax(int axis, int dim_size) const
```
vector Tensor::argmax(int axis, int dim_size) const;
```
Computes the argmax along axis 1 for a 2D tensor (flattened as a 1D array with shape (batch_size, dim_size)).
sum() const
```
float Tensor::sum() const;
```
Computes and returns the sum of all elements in the tensor.

Example

The following code snippet demonstrates a basic example using CUDA. Two tensors are created, filled with constant values, transferred to CUDA, multiplied element-wise, and then the result is transferred back to CPU for display.


  #include "kernelnet.hpp"

  // Create two tensors on CPU, fill them, and then move them to CUDA.
  Tensor a(10, CPU);
  Tensor b(10, CPU);
  a.fill(2.0f);
  b.fill(3.0f);
  a.toCUDA();
  b.toCUDA();
  
  // Compute element-wise multiplication on CUDA.
  Tensor result = Tensor::multiply(a, b);
  
  result.toCPU();
  result.print();

kernelnet::autograd

The kernelnet::autograd module forms the backbone of KernelNet’s automatic differentiation engine. It enables gradient-based optimization by linking together differentiable operations. The two central classes are kernelnet::autograd::Variable and kernelnet::autograd::Function. While kernelnet::autograd::Variable encapsulates tensor data and manages gradient information, kernelnet::autograd::Function serves as an abstract base for all differentiable operations.

Autograd Classes

A dynamic computational graph is built during the forward pass. Every time a differentiable operation (a function derived from kernelnet::autograd::Function) is executed via its static kernelnet::autograd::Function::apply method, a new kernelnet::autograd::Variable node is created. These nodes encapsulate both the computed tensor value and a pointer (via the creator field) to the operation that produced them. Such intermediate nodes, created for operations like addition, multiplication, and slicing, are only needed during backpropagation to compute gradients. We use the type aliases using VarPtr = shared_ptr<Variable> and using FuncPtr = shared_ptr<Function> so that these temporary nodes are automatically deallocated once they are no longer referenced—typically after the backward pass has propagated gradients through the graph.

kernelnet::autograd::Variable

Constructor
```
Variable::Variable(const Tensor &data, bool requires_grad = false, const string &name = "");
```
Creates a new variable that wraps a given tensor. If requires_grad is set to true, a gradient tensor (of matching size and device) is created and initialized to zero.
set_creator(const FuncPtr &func)
```
void Variable::set_creator(const FuncPtr &func);
```
Assigns the creator function (i.e., the operation that produced this variable).
backward(const Tensor &grad_output)
```
void Variable::backward(const Tensor &grad_output);
```
Initiates the backpropagation process. It accumulates gradients, and once all contributions are received, it propagates the gradient further by calling the backward method of its creator.
detach()
```
VarPtr Variable::detach();
```
Returns a detached copy of the variable that does not track gradients.

kernelnet::autograd::Function

Abstract Base Class
Serves as the base for all differentiable operations.
backward(const Tensor &grad_output)
```
virtual vector<Tensor> Function::backward(const Tensor &grad_output) = 0;
```
A pure virtual method that must be implemented by derived classes. It receives the upstream gradient and computes gradients for each input.

Child Functions

The autograd module provides several built‐in differentiable function implementations that derive from the base kernelnet::autograd:Function class. Each derived function implements a static apply method that computes its output—returning a kernelnet::autograd::Variable—while saving its inputs and parameters for the backward pass. Their corresponding backward methods receive the gradient from the next layer and compute gradients for each input.

kernelnet::autograd::AddFunction
- apply(const VarPtr &a, const VarPtr &b)
```
static VarPtr AddFunction::apply(const VarPtr &a, const VarPtr &b);
```
  Computes z = a + b by using Tensor::broadcast_add. It increments the pending gradient count of each input if gradients are required, saves the inputs, and produces a new output variable with its creator set.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> AddFunction::backward(const Tensor &grad_output);
```
  Propagates the gradient by either duplicating the incoming gradient (if inputs are of equal size) or summing gradients along the broadcasted dimensions.
kernelnet::autograd::SubtractFunction
- apply(const VarPtr &a, const VarPtr &b)
```
static VarPtr SubtractFunction::apply(const VarPtr &a, const VarPtr &b);
```
  Computes the element‐wise difference z = a - b and saves both input variables.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> SubtractFunction::backward(const Tensor &grad_output);
```
  Propagates the incoming gradient unchanged for the first input; for the second input, the gradient is multiplied by -1. Special handling is provided if the second input does not require gradients.
kernelnet::autograd::MultiplyFunction
- apply(const VarPtr &a, const VarPtr &b)
```
static VarPtr MultiplyFunction::apply(const VarPtr &a, const VarPtr &b);
```
  Computes the element‐wise product z = a * b while saving both inputs.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> MultiplyFunction::backward(const Tensor &grad_output);
```
  Implements the chain rule by computing grad_a = grad_output * b and grad_b = grad_output * a.
kernelnet::autograd::MatMulFunction
- apply(const VarPtr &a, const VarPtr &b)
```
static VarPtr MatMulFunction::apply(const VarPtr &a, const VarPtr &b, int M, int K, int N);
```
  Performs matrix multiplication between A (with shape M×K) and B (with shape K×N) via Tensor::matmul, and saves the inputs along with matrix dimensions for the backward pass.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> MatMulFunction::backward(const Tensor &grad_output);
```
  Computes gradients using transposition: grad_a = grad_output × B^T and grad_b = A^T × grad_output.
kernelnet::autograd::SumFunction
- apply(const VarPtr &input)
```
static VarPtr SumFunction::apply(const VarPtr &input);
```
  Reduces all elements of the input tensor to a scalar sum. The input tensor is saved along with its size.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> SumFunction::backward(const Tensor &grad_output);
```
  Propagates the scalar gradient to every element of the input by replicating it—using a CUDA kernel (fill_kernel) if on GPU.
kernelnet::autograd::LogFunction
- apply(const VarPtr &input)
```
static VarPtr LogFunction::apply(const VarPtr &input);
```
  Applies the natural logarithm element-wise (adding a small epsilon for numerical stability) and saves the input.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> LogFunction::backward(const Tensor &grad_output);
```
  Computes the gradient using the derivative 1/(x + epsilon) multiplied element‐wise with the incoming gradient.
kernelnet::autograd::MSEFunction
Note: This is an abstract class and cannot be instantiated directly. It provides a static helper function to construct the MSE loss computation using other differentiable operations.
- apply(const VarPtr &prediction, const Tensor &target)
```
static VarPtr MSEFunction::apply(const VarPtr &prediction, const Tensor &target);
```
  Computes the Mean Squared Error (MSE) loss by performing the following steps:
  1. Subtracts the target tensor from the prediction variable.
  2. Squares the element-wise difference.
  3. Sums all squared differences into a scalar.
  4. Scales the result by the reciprocal of the number of elements.
  This function returns a scalar kernelnet::autograd::Variable representing the MSE loss. It internally builds a computation graph using standard differentiable functions including kernelnet::autograd::SubtractFunction, kernelnet::autograd::MultiplyFunction, and kernelnet::autograd::SumFunction.
kernelnet::autograd::CrossEntropyLossFunction
Note: This is an abstract class and cannot be instantiated directly. It defines a static utility method that constructs a cross-entropy loss computation graph using standard autograd functions.
- apply(const VarPtr &prediction, const Tensor &target, int num_classes)
```
static VarPtr CrossEntropyLossFunction::apply(const VarPtr &prediction, const Tensor &target, int num_classes);
```
  Constructs the cross-entropy loss as a computation graph using the following steps:
  1. Applies a logarithm to the predictions using kernelnet::autograd::LogFunction.
  2. Multiplies the log-predictions with the target tensor (often one-hot encoded) using kernelnet::autograd::MultiplyFunction.
  3. Sums the result using kernelnet::autograd::SumFunction.
  4. Scales the summed loss:
    - Divides by batch size if num_classes > 0.
    - Otherwise, multiplies by -1.
  The returned kernelnet::autograd::Variable contains a scalar value representing the cross-entropy loss. The actual gradient computation is handled by the components (kernelnet::autograd::LogFunction, kernelnet::autograd::MultiplyFunction, kernelnet::autograd::SumFunction) used to construct the graph.
kernelnet::autograd::SliceFunction
- apply(const VarPtr &input, int batch_size, int start, int end)
```
static VarPtr SliceFunction::apply(const VarPtr &input, int batch_size, int start, int end);
```
  Interprets the input tensor as a 2D array (shape: [batch_size, total_width]) and extracts columns in the interval [start, end). It saves the input and slicing parameters and returns a new variable containing the sliced tensor.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor> SliceFunction::backward(const Tensor &grad_output);
```
  Maps the gradient from the sliced output back to the corresponding locations in the full input tensor, setting gradients outside the slice to zero.
kernelnet::autograd::ConcatFunction
- apply(const vector<VarPtr> &inputs)
```
static VarPtr ConcatFunction::apply(const vector<VarPtr> &inputs);
```
  Concatenates a list of input variables into a single output tensor by recording their individual sizes and copying data sequentially from each input.
- backward(const Tensor &grad_output)
```
virtual vector<Tensor>ConcatFunction::backward(const Tensor &grad_output);
```
  Splits the incoming gradient tensor into segments corresponding to each original input's size, returning a vector of gradient tensors.

Example

The following snippet demonstrates using the Mean Squared Error (MSE) loss with KernelNet’s autograd system. Two tensors are created on the CPU (one for prediction and one for target), and the prediction is wrapped in a Variable that requires gradients. The MSE loss is computed, and an upstream gradient of 1 is used to backpropagate the gradients.


#include "kernelnet.hpp"

// Create prediction and target tensors (size 5) on CPU.
Tensor msePred(5, CPU);
Tensor mseTarget(5, CPU);
msePred.fill(2.0f);     
mseTarget.fill(3.0f);   

// Wrap the prediction in a Variable that tracks gradients.
auto varPred = make_shared<Variable>(msePred, true);

// Compute MSE loss: loss = mean((prediction - target)^2).
auto mseLoss = MSEFunction::apply(varPred, mseTarget);

// Create an upstream gradient (of ones) for the backward pass.
Tensor gradOutput(mseLoss->data.size(), CPU);
gradOutput.fill(1.0f);

// Perform backpropagation.
mseLoss->backward(gradOutput);

varPred->grad.print();

kernelnet::nn::Module

The kernelnet::nn::Module class defines the abstract base interface for all neural network layers or components in KernelNet. It standardizes how modules perform their forward computation, how their parameters are accessed for optimization, and how gradients are reset before each optimizer step.

Every custom layer (such as Dense, Conv2D, or LSTM layers) should inherit from kernelnet::nn::Module and override the kernelnet::nn::Module::forward method. Additionally, modules can override the parameters method to expose their trainable parameters.

Methods

forward(const vector<VarPtr> &inputs)
```
virtual vector<VarPtr> Module::forward(const vector<VarPtr> &inputs) = 0;
```
Performs the forward computation of the module. Derived classes must override this method to define how input variables are transformed.
parameters()
```
virtual VarPtr Module::vector<VarPtr> parameters();
```
Returns all learnable parameters of the module. By default, it returns an empty vector. Derived modules that have trainable parameters should override this method.
zero_grad()
```
virtual void Module::zero_grad();
```
Zeros out the gradients for all parameters in the module.

kernelnet::nn::SingleInputModule

The kernelnet::nn::SingleInputModule class serves as an abstract base for neural network layers or components that operate on exactly one input and produce one output, inherited from kernelnet::nn::Module.

Methods

forward(const VarPtr &input);
```
virtual VarPtr SingleInputModule::forward(const VarPtr &input) = 0;
```
This abstract method must be implemented by any subclass.
forward(const vector<VarPtr> &inputs)
```
vector<VarPtr> SingleInputModule::forward(const vector<VarPtr> &inputs) override;
```
This method wraps the single-input forward function. It verifies that exactly one input is provided.

kernelnet::nn::Sequential

The kernelnet::nn::Sequential container module provides a way to stack layers linearly. Inheriting from kernelnet::nn::SingleInputModule, each submodule in a Sequential container accepts a single VarPtr as input and returns a single VarPtr as output.

Constructor

The kernelnet::nn::Sequential container provides two constructors:

Default Constructor
```
Sequential::Sequential() : training(true) {}
```
Constructs an empty Sequential container, defaulting to training mode.
Parameterized Constructor
```
Sequential::Sequential(initializer_list<shared_ptr<SingleInputModule>> modules)
      : layers(modules), training(true) {}
```
Initializes the Sequential container with a list of kernelnet::autograd::Module. The provided modules are stored in order, and the container is set to training mode.

Methods

forward(const VarPtr &input)
```
virtual VarPtr Sequential::forward(const VarPtr &input) override;
```
Executes the forward pass by sequentially feeding the input through each submodule. It returns the final output variable.
parameters()
```
virtual vector<VarPtr> Sequential::parameters() override;
```
Returns a flattened list of all learnable parameters from each submodule.
train()
```
void Sequential::train();
```
Sets all contained submodules into training mode.
eval()
```
void Sequential::eval();
```
Sets all contained submodules to evaluation mode.

Example

Two layers are added to a Sequential container, and a forward pass is executed.


  #include "kernelnet.hpp"

  auto layer1 = make_shared<Dense>(input_dim, hidden_dim, device);
  auto layer2 = make_shared<Dense>(hidden_dim, output_dim, device);
  Sequential model = { layer1, layer2 };
  
  VarPtr output = model.forward(input);

kernelnet::nn::Sigmoid

The kernelnet::nn::Sigmoied module implements the sigmoid activation function, defined as:

sigmoid(x) = 1 / (1 + exp(-x))

It is implemented using an autograd-compatible function kernelnet::nn::SigmoidFunction inherited from kernelnet::autograd::Function that computes both the forward pass and the gradients during backpropagation. The user-facing kernelnet::nn::Sigmoid module wraps this functionality to integrate into building the customized models.

Sigmoid Function

The kernelnet::nn::SigmoidFunction class, inherited from kernelnet::autograd::Function, provides the autograd-compatible implementation of the sigmoid activation.

In the forward pass, the input tensor is transformed into its sigmoid representation. During the backward pass, it computes the gradient based on the derivative: y * (1 - y), where y is the sigmoid output.

Methods:

apply(const VarPtr &input)
```
static VarPtr SigmoidFunction::apply(const VarPtr &input);
```
Applies sigmoid to the input variable and builds the autograd graph.
backward(const Tensor &grad_output)
```
vector<Tensor> SigmoidFunction::backward(const Tensor &grad_output);
```
Computes the gradient with respect to the input using the formula: grad_output * y * (1 - y).

Sigmoid Module

The kernelnet::nn::Sigmoid class is a user-facing module that inherits from kernelnet::nn::SingleInputModule. It acts as a wrapper for the kernelnet::nn::SigmoidFunction.

In its forward pass, the kernelnet::nn::Sigmoid module calls kernelnet::nn::SigmoidFunction::apply on the input variable.


  Sigmoid::Sigmoid() {} // Constructor
  
  VarPtr Sigmoid::forward(const VarPtr &input) {
      return SigmoidFunction::apply(input);
  }

kernelnet::nn::Softmax

The kernelnet::nn::Softmax module implements the softmax activation function, which normalizes raw input scores into a probability distribution. It is defined as:

softmax(x_i) = exp(x_i) / sum_j exp(x_j)

The module is built upon an autograd-compatible function, kernelnet::nn::SoftmaxFunction, which handles both the forward computation (applying softmax) and the backward computation (calculating gradients for backpropagation). The user-facing kernelnet::nn::Softmax class wraps this functionality to integrate into building the customized models.

Softmax Function

The kernelnet::nn::SoftmaxFunction class, inherited from kernelnet::autograd::Function, provides the autograd-compatible implementation of the softmax activation. It computes the forward pass by normalizing input scores along the last dimension of each sample, and caches the output for use in the backward pass.

During the backward pass, it calculates the gradient with respect to the input using the formula:

dL/dx_i = y_i * (dL/dy_i - sum_j(dL/dy_j * y_j)), where y = softmax(x).

Methods:

apply(const VarPtr &input, int batch_size, int num_classes)
```
static VarPtr SoftmaxFunction::apply(const VarPtr &input, int batch_size, int num_classes);
```
Applies softmax to the input variable with the specified batch_size and num_classes, constructs the autograd graph, and caches the softmax output for the backward pass.
backward(const Tensor &grad_output)
```
vector<Tensor> SoftmaxFunction::backward(const Tensor &grad_output);
```
Computes the gradient with respect to the input tensor using the formula: grad_input = y * (grad_output - sum(grad_output * y)), where y is the cached softmax output.

Softmax Module

The kernelnet::nn::Softmax class is a user-facing module that inherits from kernelnet::nn::SingleInputModule. It acts as a wrapper for kernelnet::nn::SoftmaxFunction.

In its forward pass, the kernelnet::nn::Softmax module invokes SoftmaxFunction::apply, passing the input variable along with the defined batch_size and num_classes, to compute the softmax activation.


    Softmax::Softmax(int batch_size, int num_classes)
        : batch_size(batch_size), num_classes(num_classes) {}

    VarPtr Softmax::forward(const VarPtr &input) {
        return SoftmaxFunction::apply(input, batch_size, num_classes);
    }

kernelnet::nn::Tanh

The kernelnet::nn::Tanh module implements the hyperbolic tangent (tanh) activation function, defined as:

tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

This module integrates with the autograd system by leveraging the kernelnet::nn::TanhFunction class, which handles both the forward computation and the gradient calculation needed during backpropagation.

Tanh Function

The kernelnet::nn::TanhFunction class provides the autograd-compatible implementation of the tanh activation.

In the forward pass, the input tensor is transformed element-wise using: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)). The output is stored for later use in the backward pass.

During the backward pass, the gradient is computed with the derivative: dL/dx = dL/dy * (1 - tanh(x)^2).

Methods:

apply(const VarPtr &input)
```
static VarPtr TanhFunction::apply(const VarPtr &input);
```
Applies the tanh function to the input variable and builds the autograd graph.
backward(const Tensor &grad_output)
```
vector<Tensor> TanhFunction::backward(const Tensor &grad_output);
```
Computes the gradient with respect to the input using the formula: grad_output * (1 - tanh(x)^2).

Tanh Module

The kernelnet::nn::Tanh class serves as the user-facing module that encapsulates the functionality of TanhFunction, inheriting from kernelnet::nn::SingleInputModule.

In its forward pass, the Tanh module calls TanhFunction::apply on the input variable to perform the tanh activation.


Tanh::Tanh() {} // Constructor

VarPtr Tanh::forward(const VarPtr &input) {
    return TanhFunction::apply(input);
}

kernelnet::nn::ReLU

The kernelnet::nn::ReLU module implements the Rectified Linear Unit (ReLU) activation function, defined as:

ReLU(x) = max(0, x)

This module integrates with the autograd system by utilizing the kernelnet::nn::ReLUFunction class, which manages both the forward computation and the backward gradient propagation during backpropagation.

ReLU Function

The kernelnet::nn::ReLUFunction class, inherited from kernelnet::autograd::Function, provides an autograd-compatible implementation of the ReLU activation. It computes the forward pass by applying: ReLU(x) = (x > 0) ? x : 0 element-wise on the input and caches the output for use in the backward pass.

For the backward pass, the gradient is computed only for the positive elements of the input using: grad_in = grad_output * (x > 0 ? 1 : 0).

Methods:

apply(const VarPtr &input)
```
static VarPtr ReLUFunction::apply(const VarPtr &input);
```
Applies the ReLU activation function to the input variable, sets up the autograd graph, and stores both the input and output for gradient computation.
backward(const Tensor &grad_output)
```
vector<Tensor> ReLUFunction::backward(const Tensor &grad_output);
```
Computes the gradient with respect to the input using the condition: grad_in = grad_output * ((input > 0) ? 1.0f : 0.0f).

ReLU Module

The kernelnet::nn::ReLU class is a user-facing module that inherits from kernelnet::nn::SingleInputModule and serves as a wrapper around ReLUFunction.

In its forward pass, the module calls ReLUFunction::apply on the input variable to compute the ReLU activation, with support for both CPU and CUDA computations.


ReLU::ReLU() {} // Constructor

VarPtr ReLU::forward(const VarPtr &input) {
    return ReLUFunction::apply(input);
}

kernelnet::nn::Dense

The kernelnet::nn::Dense layer is a fully connected (linear) neural network layer. It performs a linear transformation on input data by computing:

output = input × weight^T + bias

Here, the weight matrix has dimensions (output_dim × input_dim) and the bias vector has dimensions (output_dim). The Dense layer automatically manages the gradient computations during backpropagation.

Constructor

The kernelnet::nn::Dense layer, inherited from kernelnet::nn::SingleInputModule, is constructed by providing:

input_dim – the number of input features.
output_dim – the number of neurons (output features).
device – the device for tensor allocation (CPU or CUDA).

Dense::Dense(int input_dim, int output_dim, Device device)
      : input_dim(input_dim), output_dim(output_dim) {}

Methods

forward(const VarPtr &input)
```
VarPtr Dense::forward(const VarPtr &input) override;
```
Executes the forward pass by performing a matrix multiplication between the input and the transposed weight matrix, then adds the bias (replicated across the batch).
parameters()
```
vector<VarPtr> Dense::parameters();
```
Returns a vector containing the learnable parameters (weight and bias).

Example


  #include "kernelnet.hpp"

  // Create a Dense layer with 128 input features and 64 output features on CUDA.
  Dense denseLayer(128, 64, CUDA);
  
  // Create an input tensor (e.g., batch_size * input_dim).
  Tensor inputTensor(128 * batch_size, CPU);
  inputTensor.fill(1.0f);
  inputTensor.toCUDA();
  
  // Wrap the input tensor in a Variable.
  auto inputVar = make_shared<Variable>(inputTensor, true);
  
  VarPtr outputVar = denseLayer.forward(inputVar);
  
  outputVar->print();

kernelnet::nn::Embedding

The kernelnet::nn::Embedding layer converts token indices into dense embedding vectors using a learnable weight matrix. Internally, the embedding lookup is performed by the Embedding::EmbeddingLookupFunction inherited from kernelnet::autograd::Function, an autograd-compatible function that caches token indices during the forward pass and accumulates gradients for the corresponding rows of the weight during the backward pass.

Constructor

The kernelnet::nn::Embedding layer, inherited from kernelnet::nn::SingleInputModule, is constructed by specifying:

vocab_size: The number of tokens in the vocabulary.
embed_dim: The dimensionality of the embedding vectors.
dev(optional): The device on which to allocate the weight tensor (CPU or CUDA). Defaults to CPU.

Embedding::Embedding(int vocab_size, int embed_dim, Device dev = CPU)
      : vocab_size(vocab_size), embed_dim(embed_dim) {}

Methods

forward(const VarPtr &input)
```
VarPtr Embedding::forward(const VarPtr &input) override;
```
Performs an embedding lookup using the input variable (which contains token indices) and returns the corresponding embedding vectors. Internally, it calls EmbeddingLookupFunction::apply.
parameters()
```
vector<VarPtr> Embedding::parameters() override;
```
Returns a vector containing the learnable embedding weight.
EmbeddingLookupFunction::apply()
```
static VarPtr EmbeddingLookupFunction::apply(const VarPtr &indices, const VarPtr &weight, int embed_dim);
```
Internally extracts token indices, performs a row lookup on the weight, and caches the indices for the backward pass.
EmbeddingLookupFunction::backward()
```
vector<Tensor> EmbeddingLookupFunction::backward(const Tensor &grad_output);
```
Computes gradients for the embedding weight by accumulating gradient slices for each token index.

Example


  #include "kernelnet.hpp"

  // Create an Embedding layer with a vocabulary of 1000 tokens and embeddings of size 128 on CUDA.
  Embedding embeddingLayer(1000, 128, CUDA);
  
  Tensor tokenIndices(10, CPU);
  for (size_t i = 0; i < tokenIndices.size(); i++) {
      tokenIndices.data()[i] = static_cast(i % 1000);
  }
  tokenIndices.toCUDA();
  
  // Wrap the token indices in a Variable.
  auto indicesVar = make_shared<Variable>(tokenIndices, false);
  
  VarPtr embeddings = embeddingLayer.forward(indicesVar);
  
  embeddings->print();

kernelnet::nn::Conv2D

The kernelnet::nn::Conv2D layer is a learnable 2D convolutional module designed for processing image data. It applies a set of convolutional kernels (filters) to an input tensor (e.g., an image or feature map) and adds a bias, computing the operation:

output = input × weight^T + bias

Constructor

The kernelnet::nn::Conv2D layer, inherited from kernelnet::nn::SingleInputModule, is constructed by specifying:

in_channels: The number of channels in the input.
out_channels: The number of channels produced by the convolution.
kernel_h and kernel_w: Height and width of the convolutional kernel.
input_height and input_width: Dimensions of the input tensor.
stride – convolution stride.
padding – zero padding size.
device – target device (CPU or CUDA).

Conv2D::Conv2D(int in_channels, int out_channels, int kernel_h, int kernel_w,
             int input_height, int input_width, int stride, int padding, Device device)
      : in_channels(in_channels), out_channels(out_channels),
        kernel_h(kernel_h), kernel_w(kernel_w),
        stride(stride), padding(padding),
        input_height(input_height), input_width(input_width) {}

Methods

forward(const VarPtr &input)
```
VarPtr Conv2D::forward(const VarPtr &input) override;
```
Performs the forward pass by calling Conv2DFunction::apply, which computes the convolution using either CPU loops or CUDA kernels.
parameters()
```
vector<VarPtr> Conv2D::parameters() override;
```
Returns a vector containing the layer's learnable parameters: the weight and bias variables.
Conv2DFunction::apply()
```
static VarPtr Conv2DFunction::apply(const VarPtr &input,
      const VarPtr &weight, const VarPtr &bias,
      int in_channels, int input_height, int input_width,
      int out_channels, int kernel_h, int kernel_w, int stride, int padding);
```
This static method links the forward pass with the autograd system: it stores the input, weight, and bias, computes the output, and creates a Variable whose creator is set to the function for later gradient propagation.
Conv2DFunction::backward()
```
vector<Tensor> Conv2DFunction::backward(const Tensor &grad_output);
```
kernelnet::nn::Conv2DFunction is inherited from kernelnet::autograd::Function. Computes gradients with respect to the input, weight, and bias using specialized CUDA kernels on GPU or loops on CPU.

Example


  #include "kernelnet.hpp"

  // Create a Conv2D layer with 3 input channels, 16 output channels,
  // a 3x3 kernel, input dimensions 32x32, stride 1, and no padding on CUDA.
  Conv2D convLayer(3, 16, 3, 3, 32, 32, 1, 0, CUDA);
  
  // Create an input tensor for a batch of 10 samples.
  Tensor inputTensor(10 * 3 * 32 * 32, CPU);
  inputTensor.fill(1.0f);
  inputTensor.toCUDA();
  
  // Wrap the input tensor into a Variable.
  auto inputVar = make_shared<Variable>(inputTensor, true);
  
  VarPtr outputVar = convLayer.forward(inputVar);

  outputVar->toCPU();
  outputVar->print();

kernelnet::nn::MaxPool2D

The kernelnet::nn::MaxPool2D layer performs spatial downsampling by applying a max pooling operation. It takes an input tensor with shape [batch_size, channels, input_height, input_width] and outputs a pooled tensor with reduced spatial dimensions. During the forward pass, the maximum value within each pooling window is selected, and its index is stored for use in the backward pass.

Constructor

The kernelnet::nn::MaxPool2D module, inherited from kernelnet::nn::SingleInputModule. is initialized with the following parameters:

kernel_size: The size of the square pooling window.
stride: The stride for the pooling operation.
batch_size: The number of samples in the batch.
channels: The number of channels in the input tensor.
input_height and input_width: The spatial dimensions of the input.

MaxPool2D::MaxPool2D(int kernel_size, int stride,
      int batch_size, int channels,
      int input_height, int input_width)
      : kernel_size(kernel_size), stride(stride),
        batch_size(batch_size), channels(channels),
        input_height(input_height), input_width(input_width) {}

Methods

forward(const VarPtr &input)
```
VarPtr MaxPool2D::forward(const VarPtr &input) override;
```
Applies max pooling on the input variable. Internally, it calls MaxPool2DFunction::apply to execute the pooling operation and store the max indices.
MaxPool2DFunction::apply()
```
static VarPtr MaxPool2DFunction::apply(const VarPtr &input,
      int batch_size, int channels,
      int input_height, int input_width,
      int kernel_size, int stride);
```
kernelnet::nn::MaxPool2DFunction is inherited from kernelnet::autograd::Function. Performs the forward pass for max pooling and caches the max indices required for the backward pass.
MaxPool2DFunction::backward()
```
vector<Tensor> MaxPool2DFunction::backward(const Tensor &grad_output);
```
Uses stored max indices to propagate gradients from the output back to the corresponding input elements.

Example


  #include "kernelnet.hpp"

  // Create a MaxPool2D layer with a 2x2 pooling window and stride of 2, for a batch of 10 samples and 3 channels, with input size 32x32.
  MaxPool2D maxPoolLayer(2, 2, 10, 3, 32, 32);
  
  // Create an input tensor of shape (10 * 3 * 32 * 32).
  Tensor inputTensor(10 * 3 * 32 * 32, CPU);
  inputTensor.fill(1.0f);
  
  auto inputVar = make_shared<Variable>(inputTensor, true);
  
  VarPtr outputVar = maxPoolLayer.forward(inputVar);

  outputVar->print();

kernelnet::nn::LSTMCell

The kernelnet::nn::LSTMCell module implements a single time‐step of an LSTM. It encapsulates learnable parameters—input‐to‐hidden and hidden‐to‐hidden weights and biases— and internally invokes LSTMCellFunction::apply to build the autograd graph. On forward, it computes the four gates (input, forget, cell, output), updates the cell and hidden states, and caches all intermediates required for backward.

Constructor

The kernelnet::nn::LSTMCell module, inherited from kernelnet::nn::Module, is initialized with the following parameters:

input_dim: The size of input vector at each time step.
hidden_dim: The size of hidden state.
device: CPU or CUDA.


  LSTMCell::LSTMCell(int input_dim, int hidden_dim, Device device = CPU)
      : input_dim(input_dim), hidden_dim(hidden_dim), device(device) {}

Methods

forward(const vector<VarPtr>& inputs)
```
vector<VarPtr> LSTMCell::forward(const vector<VarPtr> &inputs) override;
```
Expects {input, h_prev, c_prev}; returns {h_new, c_new}.

LSTMCellFunction::apply()

static pair<VarPtr,VarPtr> LSTMCellFunction::apply(
      const VarPtr &input, const VarPtr &h_prev, const VarPtr &c_prev,
      const VarPtr &weight_ih, const VarPtr &weight_hh,
      const VarPtr &bias_ih, const VarPtr &bias_hh,
      int input_dim, int hidden_dim);

Performs all gate computations and returns new states, registering a Function for backward.

LSTMCellFunction::backward()
```
vector<Tensor> LSTMCellFunction::backward(const Tensor &grad_output) override;
```
Given gradients for h_new and optionally c_new, computes and returns gradients for {input, h_prev, c_prev, weight_ih, weight_hh, bias_ih, bias_hh}.

Example


  #include "kernelnet.hpp"

  // Create cell for input_dim=16, hidden_dim=32 on CPU
  LSTMCell lstm(16, 32, CPU);
  
  // Prepare dummy inputs
  Tensor x(10 * 16, CPU); x.fill(0.1f);
  Tensor h(10 * 32, CPU); h.fill(0.0f);
  Tensor c(10 * 32, CPU); c.fill(0.0f);
  auto x_var = make_shared<Variable>(x, true);
  auto h_var = make_shared<Variable>(h, true);
  auto c_var = make_shared<Variable>(c, true);
  
  auto [h_new, c_new] = lstm.forward({x_var, h_var, c_var});

kernelnet::nn::LSTM

The kernelnet::nn::LSTM module wraps a single kernelnet::nn::LSTMCell to process an entire sequence of length sequence_length. Given a flattened input tensor of shape [batch_size * sequence_length * input_dim], it slices out each time‐step, feeds it (along with the running hidden and cell state) into the internal kernelnet::nn::LSTMCell, and finally concatenates all hidden‐state outputs into one long tensor.

Constructor

The kernelnet::nn::LSTM module, inherited from kernelnet::nn::SingleInputModule, is initialized with the following parameters:

batch_size: The number of sequences per batch.
sequence_length: The number of time steps per sequence.
input_dim: The size of the input vector at each time step.
hidden_dim: The size of the hidden state.
device: CPU or CUDA.


LSTM::LSTM(int batch_size, int sequence_length, int input_dim, int hidden_dim, Device device)
  : batch_size(batch_size), sequence_length(sequence_length), input_dim(input_dim), hidden_dim(hidden_dim), device(device), cell(input_dim, hidden_dim, device) {}

Internally constructs a single kernelnet::nn::LSTMCell with the same input_dim and hidden_dim.

Methods

forward(const VarPtr &input)
```
virtual VarPtr LSTM::forward(const VarPtr &input) override;
```
- Expects input shaped [batch_size*sequence_length*input_dim]. - Initializes h0, c0 to zero. - Unrolls over each slice of length input_dim. - Returns a single VarPtr containing all hidden states concatenated: shape [batch_size*sequence_length, hidden_dim].
parameters()
```
virtual vector<VarPtr> LSTM::parameters() override;
```
Delegates to the internal LSTMCell’s parameters (weight_ih, weight_hh, bias_ih, bias_hh).

Example


#include "kernelnet.hpp"

// Unrolling an LSTM over a 20‐step sequence of 8‐dim inputs and 16-dim outputs with batch size 4
LSTM lstm(4, 20, 8, 16, CPU);

Tensor seq_in(4 * 20 * 8, CPU);
seq_in.fill(0.5f);
auto seq_var = make_shared<Variable>(seq_in, true);

auto out = lstm.forward(seq_var);

kernelnet::optim::SGD

kernelnet::optim::SGD implements stochastic gradient descent with an optional per‑element clipping of gradients. On each step() it optionally rescales any gradient whose ℓ₂‑norm exceeds clip_value, then updates each parameter:

param = param - lr * grad

Constructor

- params: List of trainable variables
- lr: Learning rate
- clip_value: Maximum allowed gradient norm (0 = no clipping)

SGD(const vector<VarPtr>& params, float lr, float clip_value = 0.0f);

Methods

step()
```
void SGD::step();
```
Optionally clips each parameter’s gradient, then updates param[i] -= lr * grad[i].
zero_grad()
```
void SGD::zero_grad();
```
Sets all gradients to zero and resets their initialization flags.

kernelnet::trainer::Trainer

The kernelnet::trainer::Trainer class wraps a kernelnet::nn::Sequential model, an kernelnet::optim::SGD optimizer, and a pluggable loss function inherited from kernelnet::autograd::Function. Its trainEpoch method runs, for each sample:

Forward through the model
Compute loss via the provided LossFunction
Backward pass to populate gradients
Optimizer step()
Optimizer zero_grad()

Constructor

model: The customized kernelnet::nn::Sequential model
optimizer: A kernelnet::optim::SGD instance
loss_fn: Function inherited from kernelnet::autograd::Function taking (prediction, target_tensor) → scalar loss

Trainer(const shared_ptr& model, const SGD& optimizer, LossFunction loss_fn = MSEFunction::apply);

Methods

trainEpoch(const vector<VarPtr>& inputs, const vector<VarPtr>& targets)
```
trainEpoch(const vector<VarPtr>& inputs, const vector<VarPtr>& targets);
```
Runs one epoch over the given input–target pairs, performing forward, loss, backward, update, and zero‑grad for each sample.

kernelnet::data

The kernelnet::data module provides utilities for loading and batching datasets. Currently supported:

CIFAR‑10 (image classification)
Penn Treebank(PTB) (language modeling)

kernelnet::data::CIFAR10Dataset

Loads CIFAR‑10 binary files, normalizes images, and one‑hot encodes labels.


class CIFAR10Dataset {
public:
  CIFAR10Dataset(const string &data_dir, bool train);
  size_t size() const;
  const CIFAR10Sample& getSample(size_t index) const;
};

kernelnet::data::CIFAR10DataLoader

Wraps a kernelnet::data::CIFAR10Dataset for shuffled mini‑batch iteration.


class CIFAR10DataLoader {
public:
  CIFAR10DataLoader(CIFAR10Dataset &dataset, int batch_size, bool shuffle=true);
  void reset();
  bool hasNext() const;
  pair<Tensor,Tensor> nextBatch();
};

kernelnet::data::PTBDataset

Loads a Penn Treebank text file, builds a word‐to‐index vocabulary, and slices the token stream into fixed‐length (input, target) sequence pairs.


struct PTBSample {
  Tensor input;   // length = sequence_length
  Tensor target;  // length = sequence_length
};

class PTBDataset {
public:
  PTBDataset(const string &file, int sequence_length);
  size_t size() const;
  const PTBSample& getSample(size_t index) const;
private:
  void loadFile(const string &filename);
  void buildVocabulary(const vector<string> &tokens);
};

kernelnet::data::PTBDataLoader

Wraps a kernelnet::data::PTBDataset for shuffled mini‑batch retrieval of sequence samples.


class PTBDataLoader {
public:
  PTBDataLoader(PTBDataset &dataset, int batch_size, bool shuffle=true);
  void reset();
  bool hasNext() const;
  pair<Tensor,Tensor> nextBatch();
private:
  PTBDataset &dataset;
  int batch_size;
  bool shuffle;
  vector indices;
  size_t current_index;
};

Example

Using the kernelnet::data utilities to benchmark training and evaluation on the CIFAR-10 dataset:


      #include "kernelnet.hpp"
      #include <chrono>

      void runCIFAR10Tests() {
        Device dev = CPU;
    
        int batch_size = 256;
        int num_epochs = 100;
        int num_classes = 10;
        int image_height = 32, image_width = 32, in_channels = 3;
    
        // --- Load CIFAR-10 Data ---
        CIFAR10Dataset trainDataset("data/cifar10/train", true);
        CIFAR10Dataset testDataset("data/cifar10/test", false);
        CIFAR10DataLoader trainLoader(trainDataset, batch_size, true);
        CIFAR10DataLoader testLoader(testDataset, batch_size, false);
    
        // --- Define Model Architecture ---
        // conv1: input channels 3 → 16, output dims remain 32×32
        auto conv1 = make_shared<Conv2D>(in_channels, 16, 3, 3, image_height, image_width, 1, 1, dev);
        auto pool1 = make_shared<MaxPool2D>(2, 2, batch_size, 16, image_height, image_width); // → 16×16
    
        auto conv2 = make_shared<Conv2D>(16, 32, 3, 3, image_height / 2, image_width / 2, 1, 1, dev);
        auto pool2 = make_shared<MaxPool2D>(2, 2, batch_size, 32, image_height / 2, image_width / 2); // → 8×8
    
        auto dense = make_shared<Dense>(32 * 8 * 8, num_classes, dev);
        auto softmax = make_shared<Softmax>(batch_size, num_classes);
    
        // Assemble model into a Sequential container
        shared_ptr<Sequential> model = make_shared<Sequential>(initializer_list<shared_ptr<SingleInputModule>>{
            conv1, pool1, conv2, pool2, dense, softmax});
    
        // --- Set Optimizer ---
        vector<VarPtr> params = model->parameters();
        float learning_rate = 0.01f;
        SGD optimizer(params, learning_rate);
    
        // --- Define Loss Function ---
        LossFunction loss_fn = [num_classes](const VarPtr &prediction, const Tensor &target) {
            return CrossEntropyLossFunction::apply(prediction, target, num_classes);
        };
    
        // --- Create Trainer ---
        Trainer trainer(model, optimizer, loss_fn);
    
        // --- Training Loop ---
        auto start = high_resolution_clock::now();
        for (int epoch = 0; epoch < num_epochs; epoch++) {
            float epoch_loss = 0.0f;
            int batches = 0;
    
            while (trainLoader.hasNext()) {
                auto batch = trainLoader.nextBatch();
                if (dev == CUDA) {
                    batch.first.toCUDA();
                    batch.second.toCUDA();
                }
    
                VarPtr input_var = make_shared<Variable>(batch.first, false, "input_batch");
                VarPtr target_var = make_shared<Variable>(batch.second, false, "target_batch");
    
                vector<VarPtr> inputs = {input_var};
                vector<VarPtr> targets = {target_var};
                trainer.trainEpoch(inputs, targets);
    
                VarPtr prediction = model->forward(input_var);
                VarPtr loss = loss_fn(prediction, batch.second);
                float batch_loss = loss->data.sum();
    
                epoch_loss += batch_loss;
                batches++;
            }
    
            trainLoader.reset();
            cout << "Epoch " << epoch << " Average Loss: " << epoch_loss / batches << endl;
        }
    
        auto end = high_resolution_clock::now();
        auto duration = duration_cast<milliseconds>(end - start).count();
        cout << "Custom architecture training completed in " << (duration / 1000.0) << " seconds" << endl;
    
        // --- Evaluate on Test Set ---
        int correct = 0, total = 0;
        while (testLoader.hasNext()) {
            auto batch = testLoader.nextBatch();
    
            if (dev == CUDA) {
                batch.first.toCUDA();
                batch.second.toCUDA();
            }
    
            VarPtr input_var = make_shared<Variable>(batch.first, false, "test_input");
            VarPtr prediction = model->forward(input_var);
    
            vector<int> pred_labels = prediction->data.argmax(1, num_classes);
            vector<int> true_labels = batch.second.argmax(1, num_classes);
    
            for (size_t i = 0; i < pred_labels.size(); ++i) {
                if (pred_labels[i] == true_labels[i])
                    correct++;
                total++;
            }
        }
    
        cout << "KenelNet Test Accuracy: " << (100.0 * correct / total) << "%" << endl;
    }

Using the kernelnet::data utilities to benchmark training and evaluation on the PTB dataset:


#include "kernelnet.hpp"  
#include <chrono>

// Helper function: converts a tensor of token indices to a one-hot encoded tensor
inline Tensor onehot(const Tensor &indices, int num_classes);
int runPTBTests() {
  // --- Hyperparameters and Device Setup ---
  Device dev = CPU;
  int batch_size = 8;
  int num_epochs = 20;
  int sequence_length = 35;
  int embed_dim = 128;
  int hidden_dim = 256;

  // --- Load PTB Data ---
  PTBDataset trainDataset("data/ptb/ptb.train.txt", sequence_length);
  PTBDataset testDataset("data/ptb/ptb.test.txt", sequence_length);
  PTBDataLoader trainLoader(trainDataset, batch_size, true);
  PTBDataLoader testLoader(testDataset, batch_size, false);

  int vocab_size = trainDataset.vocab_size;

  // --- Build Model ---
  // Model architecture: Embedding → LSTM (unrolled) → Dense → Softmax.
  auto embedding = make_shared<Embedding>(vocab_size, embed_dim, dev);
  auto lstm = make_shared<LSTM>(batch_size, sequence_length, embed_dim, hidden_dim, dev);
  auto dense = make_shared<Dense>(hidden_dim, vocab_size, dev);
  auto softmax = make_shared<Softmax>(batch_size * sequence_length, vocab_size);

  // Assemble the model using the kernelnet::nn::Sequential container.
  auto model = make_shared<Sequential>(initializer_list<shared_ptr<SingleInputModule>>{
      embedding,
      lstm,
      dense,
      softmax});

  // --- Setup Optimizer ---
  vector<VarPtr> params = model->parameters();
  float learning_rate = 0.01f;
  SGD optimizer(params, learning_rate);

  // --- Define Loss Function Lambda ---
  LossFunction loss_fn = [vocab_size](const VarPtr &prediction, const Tensor &target) {
      Tensor onehot_target = onehot(target, vocab_size);
      return CrossEntropyLossFunction::apply(prediction, onehot_target, vocab_size);
  };

  // --- Create Trainer ---
  // Trainer accepts model, optimizer, and the loss function.
  Trainer trainer(model, optimizer, loss_fn);

  // --- Training Loop ---
  auto start = high_resolution_clock::now();
  for (int epoch = 0; epoch < num_epochs; epoch++) {
      float epoch_loss = 0.0f;
      int batches = 0;
      while (trainLoader.hasNext()) {
          auto batch = trainLoader.nextBatch();

          if (dev == CUDA) {
              batch.first.toCUDA();
              batch.second.toCUDA();
          }

          // Wrap input into a Variable (target is passed as Tensor).
          VarPtr input_var = make_shared<Variable>(batch.first, true, "input_batch");
          VarPtr target_var = make_shared<Variable>(batch.second, false, "target_batch");

          // Trainer.trainEpoch() takes a vector of input Variables and a vector of target Tensors.
          vector<VarPtr> inputs = {input_var};
          vector<VarPtr> targets = {target_var};

          trainer.trainEpoch(inputs, targets);

          // For logging, compute loss separately:
          VarPtr prediction = model->forward(input_var);

          VarPtr loss = loss_fn(prediction, batch.second);

          float batch_loss = loss->data.sum();
          epoch_loss += batch_loss;
          batches++;
      }
      trainLoader.reset();
      cout << "Epoch " << epoch << " Average Loss: " << (epoch_loss / batches) << endl;
  }
  auto end = high_resolution_clock::now();
  auto duration = duration_cast<milliseconds>(end - start).count();
  cout << "LSTM training completed in " << (duration / 1000.0) << " seconds." << endl;

  // --- Evaluation: Compute Perplexity on Validation Set ---
  float total_loss = 0.0f;
  int total_tokens = 0;
  while (testLoader.hasNext()) {
      auto batch = testLoader.nextBatch();
      if (dev == CUDA) {
          batch.first.toCUDA();
          batch.second.toCUDA();
      }
      VarPtr input_var = make_shared<Variable>(batch.first, false, "valid_input");
      VarPtr prediction = model->forward(input_var);
      VarPtr loss = loss_fn(prediction, batch.second);
      total_loss += loss->data.sum();
      total_tokens += batch.second.size();
  }
  float avg_loss = total_loss / total_tokens;
  float perplexity = exp(avg_loss);
  cout << "Validation Perplexity: " << perplexity << endl;
}