Overview
KernelNet is a neural‑network framework built from the ground up in C++ and CUDA. It provides:
- Custom Tensor & Autograd: Multi‑dimensional arrays with element‑wise ops, matrix multiply/transpose, broadcasting, reductions, argmax, and built‑in automatic differentiation for gradient back‑propagation on both CPU and GPU.
- Neural Building Blocks: Layers for feed‑forward (Dense), convolutional (Conv2D), pooling (MaxPool2D), recurrent (LSTM), and embedding operations.
- Activation Functions: ReLU, Sigmoid, Tanh, and Softmax modules that integrate into customized models.
- Optimization & Loss: A simple SGD optimizer (with optional gradient clipping), plus MSE and Cross‑Entropy loss functions.
- Hardware Agnostic: Write once—run on CPU or accelerate with CUDA-enabled GPUs.
- Benchmarking Pipelines: Ready‑to‑use data loaders and example scripts for CIFAR‑10 and Penn Treebank to measure both performance and accuracy.
To obtain the KernelNet packages, see
the KernelNet releases page,
which contain the source code, the pre-built libraries kernelnet.dll
, kernelnet.lib
, and the /include
folder.
Note: For brevity and clarity, all examples in this documentation use
using namespace std;
. This allows us to write
vector
, string
, pair
, etc., without the
std::
prefix.
kernelnet::tensor::Tensor
The kernelnet::tensor::Tensor
class represents a multi-dimensional array that supports operations on both CPU and CUDA (GPU) devices. It supports numerical arithmetic computations—including element-wise addition, subtraction, multiplication, matrix multiplication and transpose, as well as scalar multiplication, broadcast addition, summation, and argmax—providing a solid foundation for building neural networks.
Constructor
The kernelnet::tensor::Tensor
class provides two constructors:
-
Default Constructor
Tensor::Tensor() : _size(0), _data_host(nullptr), _data_device(nullptr), _device(CPU) {}
Creates a zero-sized tensor on the CPU as the default device with no allocated memory.
-
Parameterized Constructor
Tensor::Tensor(size_t size, Device device);
Constructs a tensor with the specified number of elements and target device. It allocates host memory immediately and, if the device is CUDA, also allocates device memory and copies the host data.
Methods
Utility Methods
-
fill(float val)
void Tensor::fill(float val);
Fills the tensor with a constant value. The host memory is updated and, if on CUDA, the data is copied to the device.
-
=(const Tensor &other)
Tensor &operator=(const Tensor &other);
Frees the existing memory and deep copies data from another tensor.
-
print() const
void Tensor::print() const;
Prints the tensor’s values to the console. For a CUDA tensor, its data is first copied to the host.
-
data()
float* Tensor::data(); const float* Tensor::data() const;
Returns a pointer to the underlying memory. Depending on the device, it will reference host or device memory.
-
size() const
size_t Tensor::size() const;
Returns the total number of elements in the tensor.
-
device() const
Device Tensor::device() const;
Retrieves the current device (either CPU or CUDA) where the tensor is stored.
-
toCUDA() / toCPU()
void Tensor::toCUDA(); void Tensor::toCPU();
toCUDA()
transfers the tensor to GPU memory (allocating device memory as needed), whiletoCPU()
transfers the data back to host memory and frees any allocated CUDA memory. -
free()
void Tensor::free();
Deletes the tensor's host memory and frees device memory if allocated.
Arithmetic and Matrix Operations
-
add(const Tensor &a, const Tensor &b)
static Tensor Tensor::add(const Tensor &a, const Tensor &b);
Performs element-wise addition between two tensors. Assumes both tensors have the same size and reside on the same device.
-
subtract(const Tensor &a, const Tensor &b)
static Tensor Tensor::subtract(const Tensor &a, const Tensor &b);
Performs element-wise subtraction between two tensors. Assumes both tensors have the same size and reside on the same device.
-
multiply(const Tensor &a, const Tensor &b)
static Tensor Tensor::multiply(const Tensor &a, const Tensor &b);
Computes element-wise multiplication between two tensors. Assumes both tensors have the same size and reside on the same device.
-
broadcast_add(const Tensor &a, const Tensor &b)
static Tensor Tensor::broadcast_add(const Tensor &a, const Tensor &b);
If the tensors are of equal size, performs regular addition. If one tensor’s size divides the other, the smaller tensor is broadcast along that dimension.
-
matmul(const Tensor &a, const Tensor &b, int M, int K, int N)
static Tensor Tensor::matmul(const Tensor &a, const Tensor &b, int M, int K, int N);
Performs matrix multiplication between tensor
a
(of shape MxK) and tensorb
(of shape KxN). -
transpose(const Tensor &a, int rows, int cols)
static Tensor Tensor::transpose(const Tensor &a, int rows, int cols);
Returns the transposed tensor by treating it as a 2D matrix.
-
scalar_multiply(const Tensor &a, float scalar)
static Tensor Tensor::scalar_multiply(const Tensor &a, float scalar);
Multiplies every element in the tensor by a scalar value.
-
argmax() const
int Tensor::argmax() const;
Returns the index of the maximum element in the flattened tensor.
-
argmax(int axis, int dim_size) const
vector
Tensor::argmax(int axis, int dim_size) const; Computes the argmax along axis 1 for a 2D tensor (flattened as a 1D array with shape (batch_size, dim_size)).
-
sum() const
float Tensor::sum() const;
Computes and returns the sum of all elements in the tensor.
Example
The following code snippet demonstrates a basic example using CUDA. Two tensors are created, filled with constant values, transferred to CUDA, multiplied element-wise, and then the result is transferred back to CPU for display.
#include "kernelnet.hpp"
// Create two tensors on CPU, fill them, and then move them to CUDA.
Tensor a(10, CPU);
Tensor b(10, CPU);
a.fill(2.0f);
b.fill(3.0f);
a.toCUDA();
b.toCUDA();
// Compute element-wise multiplication on CUDA.
Tensor result = Tensor::multiply(a, b);
result.toCPU();
result.print();
kernelnet::autograd
The kernelnet::autograd
module forms the backbone of KernelNet’s automatic differentiation engine. It enables gradient-based optimization by linking together differentiable operations. The two central classes are kernelnet::autograd::Variable
and kernelnet::autograd::Function
.
While kernelnet::autograd::Variable
encapsulates tensor data and manages gradient information, kernelnet::autograd::Function
serves as an abstract base for all differentiable operations.
Autograd Classes
A dynamic computational graph is built during the forward pass. Every time a differentiable operation (a function derived from kernelnet::autograd::Function
) is executed via its static kernelnet::autograd::Function::apply
method, a new kernelnet::autograd::Variable
node is created. These nodes encapsulate both the computed tensor value and a pointer (via the creator
field) to the operation that produced them. Such intermediate nodes, created for operations like addition, multiplication, and slicing, are only needed during backpropagation to compute gradients. We use the type aliases
using VarPtr = shared_ptr<Variable>
and using FuncPtr = shared_ptr<Function>
so that these temporary nodes are automatically deallocated once they are no longer referenced—typically after the backward pass has propagated gradients through the graph.
-
Constructor
Variable::Variable(const Tensor &data, bool requires_grad = false, const string &name = "");
Creates a new variable that wraps a given tensor. If
requires_grad
is set totrue
, a gradient tensor (of matching size and device) is created and initialized to zero. -
set_creator(const FuncPtr &func)
void Variable::set_creator(const FuncPtr &func);
Assigns the creator function (i.e., the operation that produced this variable).
-
backward(const Tensor &grad_output)
void Variable::backward(const Tensor &grad_output);
Initiates the backpropagation process. It accumulates gradients, and once all contributions are received, it propagates the gradient further by calling the backward method of its creator.
-
detach()
VarPtr Variable::detach();
Returns a detached copy of the variable that does not track gradients.
-
Abstract Base Class
Serves as the base for all differentiable operations.
-
backward(const Tensor &grad_output)
virtual vector<Tensor> Function::backward(const Tensor &grad_output) = 0;
A pure virtual method that must be implemented by derived classes. It receives the upstream gradient and computes gradients for each input.
Child Functions
The autograd module provides several built‐in differentiable function implementations that derive from the base
kernelnet::autograd:Function
class. Each derived function implements a static apply
method that computes its output—returning a kernelnet::autograd::Variable
—while saving its inputs and parameters for the backward pass. Their corresponding backward
methods receive the gradient from the next layer and compute gradients for each input.
-
kernelnet::autograd::AddFunction
-
apply(const VarPtr &a, const VarPtr &b)
static VarPtr AddFunction::apply(const VarPtr &a, const VarPtr &b);
Computes
z = a + b
by usingTensor::broadcast_add
. It increments the pending gradient count of each input if gradients are required, saves the inputs, and produces a new output variable with its creator set. -
backward(const Tensor &grad_output)
virtual vector<Tensor> AddFunction::backward(const Tensor &grad_output);
Propagates the gradient by either duplicating the incoming gradient (if inputs are of equal size) or summing gradients along the broadcasted dimensions.
-
apply(const VarPtr &a, const VarPtr &b)
-
kernelnet::autograd::SubtractFunction
-
apply(const VarPtr &a, const VarPtr &b)
static VarPtr SubtractFunction::apply(const VarPtr &a, const VarPtr &b);
Computes the element‐wise difference
z = a - b
and saves both input variables. -
backward(const Tensor &grad_output)
virtual vector<Tensor> SubtractFunction::backward(const Tensor &grad_output);
Propagates the incoming gradient unchanged for the first input; for the second input, the gradient is multiplied by -1. Special handling is provided if the second input does not require gradients.
-
apply(const VarPtr &a, const VarPtr &b)
-
kernelnet::autograd::MultiplyFunction
-
apply(const VarPtr &a, const VarPtr &b)
static VarPtr MultiplyFunction::apply(const VarPtr &a, const VarPtr &b);
Computes the element‐wise product
z = a * b
while saving both inputs. -
backward(const Tensor &grad_output)
virtual vector<Tensor> MultiplyFunction::backward(const Tensor &grad_output);
Implements the chain rule by computing
grad_a = grad_output * b
andgrad_b = grad_output * a
.
-
apply(const VarPtr &a, const VarPtr &b)
-
kernelnet::autograd::MatMulFunction
-
apply(const VarPtr &a, const VarPtr &b)
static VarPtr MatMulFunction::apply(const VarPtr &a, const VarPtr &b, int M, int K, int N);
Performs matrix multiplication between
A
(with shape M×K) andB
(with shape K×N) viaTensor::matmul
, and saves the inputs along with matrix dimensions for the backward pass. -
backward(const Tensor &grad_output)
virtual vector<Tensor> MatMulFunction::backward(const Tensor &grad_output);
Computes gradients using transposition:
grad_a = grad_output × BT
andgrad_b = AT × grad_output
.
-
apply(const VarPtr &a, const VarPtr &b)
-
kernelnet::autograd::SumFunction
-
apply(const VarPtr &input)
static VarPtr SumFunction::apply(const VarPtr &input);
Reduces all elements of the input tensor to a scalar sum. The input tensor is saved along with its size.
-
backward(const Tensor &grad_output)
virtual vector<Tensor> SumFunction::backward(const Tensor &grad_output);
Propagates the scalar gradient to every element of the input by replicating it—using a CUDA kernel (
fill_kernel
) if on GPU.
-
apply(const VarPtr &input)
-
kernelnet::autograd::LogFunction
-
apply(const VarPtr &input)
static VarPtr LogFunction::apply(const VarPtr &input);
Applies the natural logarithm element-wise (adding a small epsilon for numerical stability) and saves the input.
-
backward(const Tensor &grad_output)
virtual vector<Tensor> LogFunction::backward(const Tensor &grad_output);
Computes the gradient using the derivative
1/(x + epsilon)
multiplied element‐wise with the incoming gradient.
-
apply(const VarPtr &input)
-
kernelnet::autograd::MSEFunction
Note: This is an abstract class and cannot be instantiated directly. It provides a static helper function to construct the MSE loss computation using other differentiable operations.
-
apply(const VarPtr &prediction, const Tensor &target)
static VarPtr MSEFunction::apply(const VarPtr &prediction, const Tensor &target);
Computes the Mean Squared Error (MSE) loss by performing the following steps:
- Subtracts the target tensor from the prediction variable.
- Squares the element-wise difference.
- Sums all squared differences into a scalar.
- Scales the result by the reciprocal of the number of elements.
kernelnet::autograd::Variable
representing the MSE loss. It internally builds a computation graph using standard differentiable functions includingkernelnet::autograd::SubtractFunction
,kernelnet::autograd::MultiplyFunction
, andkernelnet::autograd::SumFunction
.
-
apply(const VarPtr &prediction, const Tensor &target)
-
kernelnet::autograd::CrossEntropyLossFunction
Note: This is an abstract class and cannot be instantiated directly. It defines a static utility method that constructs a cross-entropy loss computation graph using standard autograd functions.
-
apply(const VarPtr &prediction, const Tensor &target, int num_classes)
static VarPtr CrossEntropyLossFunction::apply(const VarPtr &prediction, const Tensor &target, int num_classes);
Constructs the cross-entropy loss as a computation graph using the following steps:
- Applies a logarithm to the predictions using
kernelnet::autograd::LogFunction
. - Multiplies the log-predictions with the target tensor (often one-hot encoded) using
kernelnet::autograd::MultiplyFunction
. - Sums the result using
kernelnet::autograd::SumFunction
. - Scales the summed loss:
- Divides by batch size if
num_classes > 0
. - Otherwise, multiplies by -1.
- Divides by batch size if
kernelnet::autograd::Variable
contains a scalar value representing the cross-entropy loss. The actual gradient computation is handled by the components (kernelnet::autograd::LogFunction
,kernelnet::autograd::MultiplyFunction
,kernelnet::autograd::SumFunction
) used to construct the graph. - Applies a logarithm to the predictions using
-
apply(const VarPtr &prediction, const Tensor &target, int num_classes)
-
kernelnet::autograd::SliceFunction
-
apply(const VarPtr &input, int batch_size, int start, int end)
static VarPtr SliceFunction::apply(const VarPtr &input, int batch_size, int start, int end);
Interprets the input tensor as a 2D array (shape: [batch_size, total_width]) and extracts columns in the interval [start, end). It saves the input and slicing parameters and returns a new variable containing the sliced tensor.
-
backward(const Tensor &grad_output)
virtual vector<Tensor> SliceFunction::backward(const Tensor &grad_output);
Maps the gradient from the sliced output back to the corresponding locations in the full input tensor, setting gradients outside the slice to zero.
-
apply(const VarPtr &input, int batch_size, int start, int end)
-
kernelnet::autograd::ConcatFunction
-
apply(const vector<VarPtr> &inputs)
static VarPtr ConcatFunction::apply(const vector<VarPtr> &inputs);
Concatenates a list of input variables into a single output tensor by recording their individual sizes and copying data sequentially from each input.
-
backward(const Tensor &grad_output)
virtual vector<Tensor>ConcatFunction::backward(const Tensor &grad_output);
Splits the incoming gradient tensor into segments corresponding to each original input's size, returning a vector of gradient tensors.
-
apply(const vector<VarPtr> &inputs)
Example
The following snippet demonstrates using the Mean Squared Error (MSE) loss with KernelNet’s autograd system. Two tensors are created on the CPU (one for prediction and one for target), and the prediction is wrapped in a Variable that requires gradients. The MSE loss is computed, and an upstream gradient of 1 is used to backpropagate the gradients.
#include "kernelnet.hpp"
// Create prediction and target tensors (size 5) on CPU.
Tensor msePred(5, CPU);
Tensor mseTarget(5, CPU);
msePred.fill(2.0f);
mseTarget.fill(3.0f);
// Wrap the prediction in a Variable that tracks gradients.
auto varPred = make_shared<Variable>(msePred, true);
// Compute MSE loss: loss = mean((prediction - target)^2).
auto mseLoss = MSEFunction::apply(varPred, mseTarget);
// Create an upstream gradient (of ones) for the backward pass.
Tensor gradOutput(mseLoss->data.size(), CPU);
gradOutput.fill(1.0f);
// Perform backpropagation.
mseLoss->backward(gradOutput);
varPred->grad.print();
kernelnet::nn::Module
The kernelnet::nn::Module
class defines the abstract base interface for all neural network layers or components in KernelNet. It standardizes how modules perform their forward computation, how their parameters are accessed for optimization, and how gradients are reset before each optimizer step.
Every custom layer (such as Dense, Conv2D, or LSTM layers) should inherit from kernelnet::nn::Module
and override the kernelnet::nn::Module::forward
method. Additionally, modules can override the parameters
method to expose their trainable parameters.
Methods
-
forward(const vector<VarPtr> &inputs)
virtual vector<VarPtr> Module::forward(const vector<VarPtr> &inputs) = 0;
Performs the forward computation of the module. Derived classes must override this method to define how input variables are transformed.
-
parameters()
virtual VarPtr Module::vector<VarPtr> parameters();
Returns all learnable parameters of the module. By default, it returns an empty vector. Derived modules that have trainable parameters should override this method.
-
zero_grad()
virtual void Module::zero_grad();
Zeros out the gradients for all parameters in the module.
kernelnet::nn::SingleInputModule
The kernelnet::nn::SingleInputModule
class serves as an abstract base for neural network layers or components that operate on exactly one input and produce one output, inherited from kernelnet::nn::Module
.
Methods
-
forward(const VarPtr &input);
virtual VarPtr SingleInputModule::forward(const VarPtr &input) = 0;
This abstract method must be implemented by any subclass.
-
forward(const vector<VarPtr> &inputs)
vector<VarPtr> SingleInputModule::forward(const vector<VarPtr> &inputs) override;
This method wraps the single-input forward function. It verifies that exactly one input is provided.
kernelnet::nn::Sequential
The kernelnet::nn::Sequential
container module provides a way to stack layers linearly. Inheriting from kernelnet::nn::SingleInputModule
, each submodule in a Sequential container accepts a single VarPtr
as input and returns a single VarPtr
as output.
Constructor
The kernelnet::nn::Sequential
container provides two constructors:
-
Default Constructor
Sequential::Sequential() : training(true) {}
Constructs an empty Sequential container, defaulting to training mode.
-
Parameterized Constructor
Sequential::Sequential(initializer_list<shared_ptr<SingleInputModule>> modules) : layers(modules), training(true) {}
Initializes the Sequential container with a list of
kernelnet::autograd::Module
. The provided modules are stored in order, and the container is set to training mode.
Methods
-
forward(const VarPtr &input)
virtual VarPtr Sequential::forward(const VarPtr &input) override;
Executes the forward pass by sequentially feeding the input through each submodule. It returns the final output variable.
-
parameters()
virtual vector<VarPtr> Sequential::parameters() override;
Returns a flattened list of all learnable parameters from each submodule.
-
train()
void Sequential::train();
Sets all contained submodules into training mode.
-
eval()
void Sequential::eval();
Sets all contained submodules to evaluation mode.
Example
Two layers are added to a Sequential container, and a forward pass is executed.
#include "kernelnet.hpp"
auto layer1 = make_shared<Dense>(input_dim, hidden_dim, device);
auto layer2 = make_shared<Dense>(hidden_dim, output_dim, device);
Sequential model = { layer1, layer2 };
VarPtr output = model.forward(input);
kernelnet::nn::Sigmoid
The kernelnet::nn::Sigmoied
module implements the sigmoid activation function, defined as:
sigmoid(x) = 1 / (1 + exp(-x))
It is implemented using an autograd-compatible function kernelnet::nn::SigmoidFunction
inherited from kernelnet::autograd::Function
that computes both the forward pass and the gradients during backpropagation. The user-facing kernelnet::nn::Sigmoid
module wraps this functionality to integrate into building the customized models.
Sigmoid Function
The kernelnet::nn::SigmoidFunction
class, inherited from kernelnet::autograd::Function
, provides the autograd-compatible implementation of the sigmoid activation.
In the forward pass, the input tensor is transformed into its sigmoid representation.
During the backward pass, it computes the gradient based on the derivative: y * (1 - y)
, where y
is the sigmoid output.
Methods:
-
apply(const VarPtr &input)
static VarPtr SigmoidFunction::apply(const VarPtr &input);
Applies sigmoid to the input variable and builds the autograd graph.
-
backward(const Tensor &grad_output)
vector<Tensor> SigmoidFunction::backward(const Tensor &grad_output);
Computes the gradient with respect to the input using the formula:
grad_output * y * (1 - y)
.
Sigmoid Module
The kernelnet::nn::Sigmoid
class is a user-facing module that inherits from kernelnet::nn::SingleInputModule
. It acts as a wrapper for the kernelnet::nn::SigmoidFunction
.
In its forward pass, the kernelnet::nn::Sigmoid
module calls kernelnet::nn::SigmoidFunction::apply
on the input variable.
Sigmoid::Sigmoid() {} // Constructor
VarPtr Sigmoid::forward(const VarPtr &input) {
return SigmoidFunction::apply(input);
}
kernelnet::nn::Softmax
The kernelnet::nn::Softmax
module implements the softmax activation function, which normalizes raw input scores into a probability distribution. It is defined as:
softmax(x_i) = exp(x_i) / sum_j exp(x_j)
The module is built upon an autograd-compatible function, kernelnet::nn::SoftmaxFunction
, which handles both the forward computation (applying softmax) and the backward computation (calculating gradients for backpropagation). The user-facing kernelnet::nn::Softmax
class wraps this functionality to integrate into building the customized models.
Softmax Function
The kernelnet::nn::SoftmaxFunction
class, inherited from kernelnet::autograd::Function
, provides the autograd-compatible implementation of the softmax activation. It computes the forward pass by normalizing input scores along the last dimension of each sample, and caches the output for use in the backward pass.
During the backward pass, it calculates the gradient with respect to the input using the formula:
dL/dxi = yi * (dL/dyi - sumj(dL/dyj * yj))
,
where y = softmax(x)
.
Methods:
-
apply(const VarPtr &input, int batch_size, int num_classes)
static VarPtr SoftmaxFunction::apply(const VarPtr &input, int batch_size, int num_classes);
Applies softmax to the input variable with the specified
batch_size
andnum_classes
, constructs the autograd graph, and caches the softmax output for the backward pass. -
backward(const Tensor &grad_output)
vector<Tensor> SoftmaxFunction::backward(const Tensor &grad_output);
Computes the gradient with respect to the input tensor using the formula:
grad_input = y * (grad_output - sum(grad_output * y))
, wherey
is the cached softmax output.
Softmax Module
The kernelnet::nn::Softmax
class is a user-facing module that inherits from kernelnet::nn::SingleInputModule
. It acts as a wrapper for kernelnet::nn::SoftmaxFunction
.
In its forward pass, the kernelnet::nn::Softmax
module invokes SoftmaxFunction::apply
, passing the input variable along with the defined batch_size
and num_classes
, to compute the softmax activation.
Softmax::Softmax(int batch_size, int num_classes)
: batch_size(batch_size), num_classes(num_classes) {}
VarPtr Softmax::forward(const VarPtr &input) {
return SoftmaxFunction::apply(input, batch_size, num_classes);
}
kernelnet::nn::Tanh
The kernelnet::nn::Tanh
module implements the hyperbolic tangent (tanh) activation function, defined as:
tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
This module integrates with the autograd system by leveraging the kernelnet::nn::TanhFunction
class, which handles both the forward computation and the gradient calculation needed during backpropagation.
Tanh Function
The kernelnet::nn::TanhFunction
class provides the autograd-compatible implementation of the tanh activation.
In the forward pass, the input tensor is transformed element-wise using:
tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
.
The output is stored for later use in the backward pass.
During the backward pass, the gradient is computed with the derivative:
dL/dx = dL/dy * (1 - tanh(x)^2)
.
Methods:
-
apply(const VarPtr &input)
static VarPtr TanhFunction::apply(const VarPtr &input);
Applies the tanh function to the input variable and builds the autograd graph.
-
backward(const Tensor &grad_output)
vector<Tensor> TanhFunction::backward(const Tensor &grad_output);
Computes the gradient with respect to the input using the formula:
grad_output * (1 - tanh(x)^2)
.
Tanh Module
The kernelnet::nn::Tanh
class serves as the user-facing module that encapsulates the functionality of TanhFunction
, inheriting from kernelnet::nn::SingleInputModule
.
In its forward pass, the Tanh
module calls TanhFunction::apply
on the input variable to perform the tanh activation.
Tanh::Tanh() {} // Constructor
VarPtr Tanh::forward(const VarPtr &input) {
return TanhFunction::apply(input);
}
kernelnet::nn::ReLU
The kernelnet::nn::ReLU
module implements the Rectified Linear Unit (ReLU) activation function, defined as:
ReLU(x) = max(0, x)
This module integrates with the autograd system by utilizing the kernelnet::nn::ReLUFunction
class, which manages both the forward computation and the backward gradient propagation during backpropagation.
ReLU Function
The kernelnet::nn::ReLUFunction
class, inherited from kernelnet::autograd::Function
, provides an autograd-compatible implementation of the ReLU activation. It computes the forward pass by applying:
ReLU(x) = (x > 0) ? x : 0
element-wise on the input and caches the output for use in the backward pass.
For the backward pass, the gradient is computed only for the positive elements of the input using:
grad_in = grad_output * (x > 0 ? 1 : 0)
.
Methods:
-
apply(const VarPtr &input)
static VarPtr ReLUFunction::apply(const VarPtr &input);
Applies the ReLU activation function to the input variable, sets up the autograd graph, and stores both the input and output for gradient computation.
-
backward(const Tensor &grad_output)
vector<Tensor> ReLUFunction::backward(const Tensor &grad_output);
Computes the gradient with respect to the input using the condition:
grad_in = grad_output * ((input > 0) ? 1.0f : 0.0f)
.
ReLU Module
The kernelnet::nn::ReLU
class is a user-facing module that inherits from kernelnet::nn::SingleInputModule
and serves as a wrapper around ReLUFunction
.
In its forward pass, the module calls ReLUFunction::apply
on the input variable to compute the ReLU activation, with support for both CPU and CUDA computations.
ReLU::ReLU() {} // Constructor
VarPtr ReLU::forward(const VarPtr &input) {
return ReLUFunction::apply(input);
}
kernelnet::nn::Dense
The kernelnet::nn::Dense
layer is a fully connected (linear) neural network layer. It performs a linear transformation on input data by computing:
output = input × weightT + bias
Here, the weight matrix has dimensions (output_dim × input_dim)
and the bias vector has dimensions (output_dim)
. The Dense layer automatically manages the gradient computations during backpropagation.
Constructor
The kernelnet::nn::Dense
layer, inherited from kernelnet::nn::SingleInputModule
, is constructed by providing:
- input_dim – the number of input features.
- output_dim – the number of neurons (output features).
-
device – the device for tensor allocation (
CPU
orCUDA
).
Dense::Dense(int input_dim, int output_dim, Device device)
: input_dim(input_dim), output_dim(output_dim) {}
Methods
-
forward(const VarPtr &input)
VarPtr Dense::forward(const VarPtr &input) override;
Executes the forward pass by performing a matrix multiplication between the input and the transposed weight matrix, then adds the bias (replicated across the batch).
-
parameters()
vector<VarPtr> Dense::parameters();
Returns a vector containing the learnable parameters (weight and bias).
Example
#include "kernelnet.hpp"
// Create a Dense layer with 128 input features and 64 output features on CUDA.
Dense denseLayer(128, 64, CUDA);
// Create an input tensor (e.g., batch_size * input_dim).
Tensor inputTensor(128 * batch_size, CPU);
inputTensor.fill(1.0f);
inputTensor.toCUDA();
// Wrap the input tensor in a Variable.
auto inputVar = make_shared<Variable>(inputTensor, true);
VarPtr outputVar = denseLayer.forward(inputVar);
outputVar->print();
kernelnet::nn::Embedding
The kernelnet::nn::Embedding
layer converts token indices into dense embedding vectors using a learnable weight matrix.
Internally, the embedding lookup is performed by the Embedding::EmbeddingLookupFunction
inherited from kernelnet::autograd::Function
, an autograd-compatible function that caches
token indices during the forward pass and accumulates gradients for the corresponding rows of the weight during the backward pass.
Constructor
The kernelnet::nn::Embedding
layer, inherited from kernelnet::nn::SingleInputModule
, is constructed by specifying:
- vocab_size: The number of tokens in the vocabulary.
- embed_dim: The dimensionality of the embedding vectors.
-
dev(optional): The device on which to allocate the weight tensor (
CPU
orCUDA
). Defaults toCPU
.
Embedding::Embedding(int vocab_size, int embed_dim, Device dev = CPU)
: vocab_size(vocab_size), embed_dim(embed_dim) {}
Methods
-
forward(const VarPtr &input)
VarPtr Embedding::forward(const VarPtr &input) override;
Performs an embedding lookup using the input variable (which contains token indices) and returns the corresponding embedding vectors. Internally, it calls
EmbeddingLookupFunction::apply
. -
parameters()
vector<VarPtr> Embedding::parameters() override;
Returns a vector containing the learnable embedding weight.
-
EmbeddingLookupFunction::apply()
static VarPtr EmbeddingLookupFunction::apply(const VarPtr &indices, const VarPtr &weight, int embed_dim);
Internally extracts token indices, performs a row lookup on the weight, and caches the indices for the backward pass.
-
EmbeddingLookupFunction::backward()
vector<Tensor> EmbeddingLookupFunction::backward(const Tensor &grad_output);
Computes gradients for the embedding weight by accumulating gradient slices for each token index.
Example
#include "kernelnet.hpp"
// Create an Embedding layer with a vocabulary of 1000 tokens and embeddings of size 128 on CUDA.
Embedding embeddingLayer(1000, 128, CUDA);
Tensor tokenIndices(10, CPU);
for (size_t i = 0; i < tokenIndices.size(); i++) {
tokenIndices.data()[i] = static_cast(i % 1000);
}
tokenIndices.toCUDA();
// Wrap the token indices in a Variable.
auto indicesVar = make_shared<Variable>(tokenIndices, false);
VarPtr embeddings = embeddingLayer.forward(indicesVar);
embeddings->print();
kernelnet::nn::Conv2D
The kernelnet::nn::Conv2D
layer is a learnable 2D convolutional module designed for processing image data.
It applies a set of convolutional kernels (filters) to an input tensor (e.g., an image or feature map) and adds a bias,
computing the operation:
output = input × weight^T + bias
Constructor
The kernelnet::nn::Conv2D
layer, inherited from kernelnet::nn::SingleInputModule
, is constructed by specifying:
- in_channels: The number of channels in the input.
- out_channels: The number of channels produced by the convolution.
- kernel_h and kernel_w: Height and width of the convolutional kernel.
- input_height and input_width: Dimensions of the input tensor.
- stride – convolution stride.
- padding – zero padding size.
-
device – target device (
CPU
orCUDA
).
Conv2D::Conv2D(int in_channels, int out_channels, int kernel_h, int kernel_w,
int input_height, int input_width, int stride, int padding, Device device)
: in_channels(in_channels), out_channels(out_channels),
kernel_h(kernel_h), kernel_w(kernel_w),
stride(stride), padding(padding),
input_height(input_height), input_width(input_width) {}
Methods
-
forward(const VarPtr &input)
VarPtr Conv2D::forward(const VarPtr &input) override;
Performs the forward pass by calling
Conv2DFunction::apply
, which computes the convolution using either CPU loops or CUDA kernels. -
parameters()
vector<VarPtr> Conv2D::parameters() override;
Returns a vector containing the layer's learnable parameters: the weight and bias variables.
-
Conv2DFunction::apply()
static VarPtr Conv2DFunction::apply(const VarPtr &input, const VarPtr &weight, const VarPtr &bias, int in_channels, int input_height, int input_width, int out_channels, int kernel_h, int kernel_w, int stride, int padding);
This static method links the forward pass with the autograd system: it stores the input, weight, and bias, computes the output, and creates a Variable whose creator is set to the function for later gradient propagation.
-
Conv2DFunction::backward()
vector<Tensor> Conv2DFunction::backward(const Tensor &grad_output);
kernelnet::nn::Conv2DFunction
is inherited fromkernelnet::autograd::Function
. Computes gradients with respect to the input, weight, and bias using specialized CUDA kernels on GPU or loops on CPU.
Example
#include "kernelnet.hpp"
// Create a Conv2D layer with 3 input channels, 16 output channels,
// a 3x3 kernel, input dimensions 32x32, stride 1, and no padding on CUDA.
Conv2D convLayer(3, 16, 3, 3, 32, 32, 1, 0, CUDA);
// Create an input tensor for a batch of 10 samples.
Tensor inputTensor(10 * 3 * 32 * 32, CPU);
inputTensor.fill(1.0f);
inputTensor.toCUDA();
// Wrap the input tensor into a Variable.
auto inputVar = make_shared<Variable>(inputTensor, true);
VarPtr outputVar = convLayer.forward(inputVar);
outputVar->toCPU();
outputVar->print();
kernelnet::nn::MaxPool2D
The kernelnet::nn::MaxPool2D
layer performs spatial downsampling by applying a max pooling operation. It takes an input tensor with shape
[batch_size, channels, input_height, input_width]
and outputs a pooled tensor with reduced spatial dimensions.
During the forward pass, the maximum value within each pooling window is selected, and its index is stored for use in the backward pass.
Constructor
The kernelnet::nn::MaxPool2D
module, inherited from kernelnet::nn::SingleInputModule
. is initialized with the following parameters:
- kernel_size: The size of the square pooling window.
- stride: The stride for the pooling operation.
- batch_size: The number of samples in the batch.
- channels: The number of channels in the input tensor.
- input_height and input_width: The spatial dimensions of the input.
MaxPool2D::MaxPool2D(int kernel_size, int stride,
int batch_size, int channels,
int input_height, int input_width)
: kernel_size(kernel_size), stride(stride),
batch_size(batch_size), channels(channels),
input_height(input_height), input_width(input_width) {}
Methods
-
forward(const VarPtr &input)
VarPtr MaxPool2D::forward(const VarPtr &input) override;
Applies max pooling on the input variable. Internally, it calls
MaxPool2DFunction::apply
to execute the pooling operation and store the max indices. -
MaxPool2DFunction::apply()
static VarPtr MaxPool2DFunction::apply(const VarPtr &input, int batch_size, int channels, int input_height, int input_width, int kernel_size, int stride);
kernelnet::nn::MaxPool2DFunction
is inherited fromkernelnet::autograd::Function
. Performs the forward pass for max pooling and caches the max indices required for the backward pass. -
MaxPool2DFunction::backward()
vector<Tensor> MaxPool2DFunction::backward(const Tensor &grad_output);
Uses stored max indices to propagate gradients from the output back to the corresponding input elements.
Example
#include "kernelnet.hpp"
// Create a MaxPool2D layer with a 2x2 pooling window and stride of 2, for a batch of 10 samples and 3 channels, with input size 32x32.
MaxPool2D maxPoolLayer(2, 2, 10, 3, 32, 32);
// Create an input tensor of shape (10 * 3 * 32 * 32).
Tensor inputTensor(10 * 3 * 32 * 32, CPU);
inputTensor.fill(1.0f);
auto inputVar = make_shared<Variable>(inputTensor, true);
VarPtr outputVar = maxPoolLayer.forward(inputVar);
outputVar->print();
kernelnet::nn::LSTMCell
The kernelnet::nn::LSTMCell
module implements a single time‐step of an LSTM.
It encapsulates learnable parameters—input‐to‐hidden and hidden‐to‐hidden weights and biases—
and internally invokes LSTMCellFunction::apply
to build the autograd graph.
On forward, it computes the four gates (input, forget, cell, output), updates the cell and hidden states,
and caches all intermediates required for backward.
Constructor
The kernelnet::nn::LSTMCell
module, inherited from kernelnet::nn::Module
, is initialized with the following parameters:
- input_dim: The size of input vector at each time step.
- hidden_dim: The size of hidden state.
- device:
CPU
orCUDA
.
LSTMCell::LSTMCell(int input_dim, int hidden_dim, Device device = CPU)
: input_dim(input_dim), hidden_dim(hidden_dim), device(device) {}
Methods
-
forward(const vector<VarPtr>& inputs)
vector<VarPtr> LSTMCell::forward(const vector<VarPtr> &inputs) override;
Expects
{input, h_prev, c_prev}
; returns{h_new, c_new}
. -
LSTMCellFunction::apply()
static pair<VarPtr,VarPtr> LSTMCellFunction::apply( const VarPtr &input, const VarPtr &h_prev, const VarPtr &c_prev, const VarPtr &weight_ih, const VarPtr &weight_hh, const VarPtr &bias_ih, const VarPtr &bias_hh, int input_dim, int hidden_dim);
Performs all gate computations and returns new states, registering a
Function
for backward. -
LSTMCellFunction::backward()
vector<Tensor> LSTMCellFunction::backward(const Tensor &grad_output) override;
Given gradients for
h_new
and optionallyc_new
, computes and returns gradients for{input, h_prev, c_prev, weight_ih, weight_hh, bias_ih, bias_hh}
.
Example
#include "kernelnet.hpp"
// Create cell for input_dim=16, hidden_dim=32 on CPU
LSTMCell lstm(16, 32, CPU);
// Prepare dummy inputs
Tensor x(10 * 16, CPU); x.fill(0.1f);
Tensor h(10 * 32, CPU); h.fill(0.0f);
Tensor c(10 * 32, CPU); c.fill(0.0f);
auto x_var = make_shared<Variable>(x, true);
auto h_var = make_shared<Variable>(h, true);
auto c_var = make_shared<Variable>(c, true);
auto [h_new, c_new] = lstm.forward({x_var, h_var, c_var});
kernelnet::nn::LSTM
The kernelnet::nn::LSTM
module wraps a single kernelnet::nn::LSTMCell
to process an entire sequence of length sequence_length
.
Given a flattened input tensor of shape
[batch_size * sequence_length * input_dim]
, it slices out each time‐step, feeds it (along with the running hidden and cell state) into the internal kernelnet::nn::LSTMCell
,
and finally concatenates all hidden‐state outputs into one long tensor.
Constructor
The kernelnet::nn::LSTM
module, inherited from kernelnet::nn::SingleInputModule
, is initialized with the following parameters:
- batch_size: The number of sequences per batch.
- sequence_length: The number of time steps per sequence.
- input_dim: The size of the input vector at each time step.
- hidden_dim: The size of the hidden state.
- device:
CPU
orCUDA
.
LSTM::LSTM(int batch_size, int sequence_length, int input_dim, int hidden_dim, Device device)
: batch_size(batch_size), sequence_length(sequence_length), input_dim(input_dim), hidden_dim(hidden_dim), device(device), cell(input_dim, hidden_dim, device) {}
Internally constructs a single kernelnet::nn::LSTMCell
with the same input_dim
and hidden_dim
.
Methods
-
forward(const VarPtr &input)
virtual VarPtr LSTM::forward(const VarPtr &input) override;
- Expects
input
shaped[batch_size*sequence_length*input_dim]
. - Initializesh0
,c0
to zero. - Unrolls over each slice of lengthinput_dim
. - Returns a singleVarPtr
containing all hidden states concatenated: shape[batch_size*sequence_length, hidden_dim]
. -
parameters()
virtual vector<VarPtr> LSTM::parameters() override;
Delegates to the internal
LSTMCell
’s parameters (weight_ih
,weight_hh
,bias_ih
,bias_hh
).
Example
#include "kernelnet.hpp"
// Unrolling an LSTM over a 20‐step sequence of 8‐dim inputs and 16-dim outputs with batch size 4
LSTM lstm(4, 20, 8, 16, CPU);
Tensor seq_in(4 * 20 * 8, CPU);
seq_in.fill(0.5f);
auto seq_var = make_shared<Variable>(seq_in, true);
auto out = lstm.forward(seq_var);
kernelnet::optim::SGD
kernelnet::optim::SGD
implements stochastic gradient descent with an optional per‑element clipping of gradients.
On each step()
it optionally rescales any gradient whose ℓ₂‑norm exceeds clip_value
, then updates each parameter:
param = param - lr * grad
Constructor
- params: List of trainable variables
- lr: Learning rate
- clip_value: Maximum allowed gradient norm (0 = no clipping)
SGD(const vector<VarPtr>& params, float lr, float clip_value = 0.0f);
Methods
-
step()
void SGD::step();
Optionally clips each parameter’s gradient, then updates
param[i] -= lr * grad[i]
. -
zero_grad()
void SGD::zero_grad();
Sets all gradients to zero and resets their initialization flags.
kernelnet::trainer::Trainer
The kernelnet::trainer::Trainer
class wraps a kernelnet::nn::Sequential
model, an
kernelnet::optim::SGD
optimizer, and a pluggable loss function inherited from kernelnet::autograd::Function
. Its
trainEpoch
method runs, for each sample:
- Forward through the model
- Compute loss via the provided LossFunction
- Backward pass to populate gradients
- Optimizer
step()
- Optimizer
zero_grad()
Constructor
- model: The customized
kernelnet::nn::Sequential
model - optimizer: A
kernelnet::optim::SGD
instance - loss_fn: Function inherited from
kernelnet::autograd::Function
taking(prediction, target_tensor)
→ scalar loss
Trainer(const shared_ptr& model, const SGD& optimizer, LossFunction loss_fn = MSEFunction::apply);
Methods
-
trainEpoch(const vector<VarPtr>& inputs, const vector<VarPtr>& targets)
trainEpoch(const vector<VarPtr>& inputs, const vector<VarPtr>& targets);
Runs one epoch over the given input–target pairs, performing forward, loss, backward, update, and zero‑grad for each sample.
kernelnet::data
The kernelnet::data
module provides utilities for loading and batching
datasets. Currently supported:
- CIFAR‑10 (image classification)
- Penn Treebank(PTB) (language modeling)
kernelnet::data::CIFAR10Dataset
Loads CIFAR‑10 binary files, normalizes images, and one‑hot encodes labels.
class CIFAR10Dataset {
public:
CIFAR10Dataset(const string &data_dir, bool train);
size_t size() const;
const CIFAR10Sample& getSample(size_t index) const;
};
kernelnet::data::CIFAR10DataLoader
Wraps a kernelnet::data::CIFAR10Dataset
for shuffled mini‑batch iteration.
class CIFAR10DataLoader {
public:
CIFAR10DataLoader(CIFAR10Dataset &dataset, int batch_size, bool shuffle=true);
void reset();
bool hasNext() const;
pair<Tensor,Tensor> nextBatch();
};
kernelnet::data::PTBDataset
Loads a Penn Treebank text file, builds a word‐to‐index vocabulary, and slices the token stream into fixed‐length (input, target) sequence pairs.
struct PTBSample {
Tensor input; // length = sequence_length
Tensor target; // length = sequence_length
};
class PTBDataset {
public:
PTBDataset(const string &file, int sequence_length);
size_t size() const;
const PTBSample& getSample(size_t index) const;
private:
void loadFile(const string &filename);
void buildVocabulary(const vector<string> &tokens);
};
kernelnet::data::PTBDataLoader
Wraps a kernelnet::data::PTBDataset
for shuffled mini‑batch retrieval of sequence samples.
class PTBDataLoader {
public:
PTBDataLoader(PTBDataset &dataset, int batch_size, bool shuffle=true);
void reset();
bool hasNext() const;
pair<Tensor,Tensor> nextBatch();
private:
PTBDataset &dataset;
int batch_size;
bool shuffle;
vector indices;
size_t current_index;
};
Example
Using the kernelnet::data
utilities to benchmark training and evaluation on the CIFAR-10 dataset:
#include "kernelnet.hpp"
#include <chrono>
void runCIFAR10Tests() {
Device dev = CPU;
int batch_size = 256;
int num_epochs = 100;
int num_classes = 10;
int image_height = 32, image_width = 32, in_channels = 3;
// --- Load CIFAR-10 Data ---
CIFAR10Dataset trainDataset("data/cifar10/train", true);
CIFAR10Dataset testDataset("data/cifar10/test", false);
CIFAR10DataLoader trainLoader(trainDataset, batch_size, true);
CIFAR10DataLoader testLoader(testDataset, batch_size, false);
// --- Define Model Architecture ---
// conv1: input channels 3 → 16, output dims remain 32×32
auto conv1 = make_shared<Conv2D>(in_channels, 16, 3, 3, image_height, image_width, 1, 1, dev);
auto pool1 = make_shared<MaxPool2D>(2, 2, batch_size, 16, image_height, image_width); // → 16×16
auto conv2 = make_shared<Conv2D>(16, 32, 3, 3, image_height / 2, image_width / 2, 1, 1, dev);
auto pool2 = make_shared<MaxPool2D>(2, 2, batch_size, 32, image_height / 2, image_width / 2); // → 8×8
auto dense = make_shared<Dense>(32 * 8 * 8, num_classes, dev);
auto softmax = make_shared<Softmax>(batch_size, num_classes);
// Assemble model into a Sequential container
shared_ptr<Sequential> model = make_shared<Sequential>(initializer_list<shared_ptr<SingleInputModule>>{
conv1, pool1, conv2, pool2, dense, softmax});
// --- Set Optimizer ---
vector<VarPtr> params = model->parameters();
float learning_rate = 0.01f;
SGD optimizer(params, learning_rate);
// --- Define Loss Function ---
LossFunction loss_fn = [num_classes](const VarPtr &prediction, const Tensor &target) {
return CrossEntropyLossFunction::apply(prediction, target, num_classes);
};
// --- Create Trainer ---
Trainer trainer(model, optimizer, loss_fn);
// --- Training Loop ---
auto start = high_resolution_clock::now();
for (int epoch = 0; epoch < num_epochs; epoch++) {
float epoch_loss = 0.0f;
int batches = 0;
while (trainLoader.hasNext()) {
auto batch = trainLoader.nextBatch();
if (dev == CUDA) {
batch.first.toCUDA();
batch.second.toCUDA();
}
VarPtr input_var = make_shared<Variable>(batch.first, false, "input_batch");
VarPtr target_var = make_shared<Variable>(batch.second, false, "target_batch");
vector<VarPtr> inputs = {input_var};
vector<VarPtr> targets = {target_var};
trainer.trainEpoch(inputs, targets);
VarPtr prediction = model->forward(input_var);
VarPtr loss = loss_fn(prediction, batch.second);
float batch_loss = loss->data.sum();
epoch_loss += batch_loss;
batches++;
}
trainLoader.reset();
cout << "Epoch " << epoch << " Average Loss: " << epoch_loss / batches << endl;
}
auto end = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(end - start).count();
cout << "Custom architecture training completed in " << (duration / 1000.0) << " seconds" << endl;
// --- Evaluate on Test Set ---
int correct = 0, total = 0;
while (testLoader.hasNext()) {
auto batch = testLoader.nextBatch();
if (dev == CUDA) {
batch.first.toCUDA();
batch.second.toCUDA();
}
VarPtr input_var = make_shared<Variable>(batch.first, false, "test_input");
VarPtr prediction = model->forward(input_var);
vector<int> pred_labels = prediction->data.argmax(1, num_classes);
vector<int> true_labels = batch.second.argmax(1, num_classes);
for (size_t i = 0; i < pred_labels.size(); ++i) {
if (pred_labels[i] == true_labels[i])
correct++;
total++;
}
}
cout << "KenelNet Test Accuracy: " << (100.0 * correct / total) << "%" << endl;
}
Using the kernelnet::data
utilities to benchmark training and evaluation on the PTB dataset:
#include "kernelnet.hpp"
#include <chrono>
// Helper function: converts a tensor of token indices to a one-hot encoded tensor
inline Tensor onehot(const Tensor &indices, int num_classes);
int runPTBTests() {
// --- Hyperparameters and Device Setup ---
Device dev = CPU;
int batch_size = 8;
int num_epochs = 20;
int sequence_length = 35;
int embed_dim = 128;
int hidden_dim = 256;
// --- Load PTB Data ---
PTBDataset trainDataset("data/ptb/ptb.train.txt", sequence_length);
PTBDataset testDataset("data/ptb/ptb.test.txt", sequence_length);
PTBDataLoader trainLoader(trainDataset, batch_size, true);
PTBDataLoader testLoader(testDataset, batch_size, false);
int vocab_size = trainDataset.vocab_size;
// --- Build Model ---
// Model architecture: Embedding → LSTM (unrolled) → Dense → Softmax.
auto embedding = make_shared<Embedding>(vocab_size, embed_dim, dev);
auto lstm = make_shared<LSTM>(batch_size, sequence_length, embed_dim, hidden_dim, dev);
auto dense = make_shared<Dense>(hidden_dim, vocab_size, dev);
auto softmax = make_shared<Softmax>(batch_size * sequence_length, vocab_size);
// Assemble the model using the kernelnet::nn::Sequential container.
auto model = make_shared<Sequential>(initializer_list<shared_ptr<SingleInputModule>>{
embedding,
lstm,
dense,
softmax});
// --- Setup Optimizer ---
vector<VarPtr> params = model->parameters();
float learning_rate = 0.01f;
SGD optimizer(params, learning_rate);
// --- Define Loss Function Lambda ---
LossFunction loss_fn = [vocab_size](const VarPtr &prediction, const Tensor &target) {
Tensor onehot_target = onehot(target, vocab_size);
return CrossEntropyLossFunction::apply(prediction, onehot_target, vocab_size);
};
// --- Create Trainer ---
// Trainer accepts model, optimizer, and the loss function.
Trainer trainer(model, optimizer, loss_fn);
// --- Training Loop ---
auto start = high_resolution_clock::now();
for (int epoch = 0; epoch < num_epochs; epoch++) {
float epoch_loss = 0.0f;
int batches = 0;
while (trainLoader.hasNext()) {
auto batch = trainLoader.nextBatch();
if (dev == CUDA) {
batch.first.toCUDA();
batch.second.toCUDA();
}
// Wrap input into a Variable (target is passed as Tensor).
VarPtr input_var = make_shared<Variable>(batch.first, true, "input_batch");
VarPtr target_var = make_shared<Variable>(batch.second, false, "target_batch");
// Trainer.trainEpoch() takes a vector of input Variables and a vector of target Tensors.
vector<VarPtr> inputs = {input_var};
vector<VarPtr> targets = {target_var};
trainer.trainEpoch(inputs, targets);
// For logging, compute loss separately:
VarPtr prediction = model->forward(input_var);
VarPtr loss = loss_fn(prediction, batch.second);
float batch_loss = loss->data.sum();
epoch_loss += batch_loss;
batches++;
}
trainLoader.reset();
cout << "Epoch " << epoch << " Average Loss: " << (epoch_loss / batches) << endl;
}
auto end = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(end - start).count();
cout << "LSTM training completed in " << (duration / 1000.0) << " seconds." << endl;
// --- Evaluation: Compute Perplexity on Validation Set ---
float total_loss = 0.0f;
int total_tokens = 0;
while (testLoader.hasNext()) {
auto batch = testLoader.nextBatch();
if (dev == CUDA) {
batch.first.toCUDA();
batch.second.toCUDA();
}
VarPtr input_var = make_shared<Variable>(batch.first, false, "valid_input");
VarPtr prediction = model->forward(input_var);
VarPtr loss = loss_fn(prediction, batch.second);
total_loss += loss->data.sum();
total_tokens += batch.second.size();
}
float avg_loss = total_loss / total_tokens;
float perplexity = exp(avg_loss);
cout << "Validation Perplexity: " << perplexity << endl;
}