Abstract
We describe the syntax of the widely popularnumDeriv
package and show how the functions from the pnd package
recognise and handle the parameters related to numerical difference
computation. We draw parallels between the two, examine the differences,
and provide recommendations for new users.
Load the packages first:
library(pnd)
#> Parallel numerical derivatives v.0.0.10 (2025-04-02).
#> 24 physical cores for parallelism through mclapply forking are available on Linux.
library(numDeriv)Numerical derivatives are ubiquitous in applies statistics and
numerical methods, which is why many packages rely on streamlined
implementations of numerical derivatives as dependencies. The popular
numDeriv package by Paul Gilbert and Ravi Varadhan has 314
reverse dependencies as of October 2024. Its flexibility allowed other
packages to achieve better results by providing reasonable defaults and
the interface for fine-tuning that matters in numerical optimisation and
statistical inference. However, it has no parallel capabilities, which
is why it takes a lot of time to calculate numerical gradients with its
grad function.
This vignette showcases the similarities and differences between the
features of numDeriv and pnd for a smooth and
painless transition to the new package.
numDeriv::gradThe implementation of numDeriv::grad() is clever: it
allows both vectorised and non-vectorised functions to be provided as
the input argument. Since gradients are defined for scalar functions, it
has to check whether the function of interest is scalar- or vector
values. This can be non-trivial with vectorised functions: although
\(\sin x\) is as scalar function,
sin(1:100) will return a vector in R. Therefore,
numDeriv::grad() has to determine if the
implementation of \(f\) is
\(\mathbb{R}^n \mapsto \mathbb{R}^n\)
(scalar / vectorised scalar), or \(\mathbb{R}^n \mapsto \mathbb{R}\) (scalar
multivariate function), or neither. Hence, functions like \(\mathbb{R} \mapsto \mathbb{R}^n\) or \(\mathbb{R}^n \mapsto \mathbb{R}^m\) are not
considered valid inputs.
Internally, numDeriv::grad(f, x0, ...) computes
f0 = f(x0, ...) and checks the length of f0.
For the input argument x0 with length(x0) = n,
the allowed lengths of f0 are either 1 (scalar
output) or n (vectorised output). Expressions such as
grad(function(x) c(sin(x), cos(x)), x = 1) return an error,
implying that for vector-valued function, the user might want to compute
a Jacobian. Similarly, numDeriv::hessian() checks the
dimensions of f0 = f(x0, ...) and only allows
non-vectorised functions with length(f0) == 1. Finally, the
function numDeriv::genD computes the first- and
second-derivative information (without cross-derivatives).
This check is not without its drawbacks: it may miscalculate function
dimensions. The implementation in pnd::Grad() handles the
edge cases that cause wrong outputs in
numDeriv::grad().
The pnd package provides similar functions:
Grad() is a drop-in replacement for
numDeriv::grad(), Hessian() replaces
numDeriv::hessian(), Jacobian() subsumes
numDeriv::jacobian(), and GenD(), whilst not
corresponding to numDeriv::genD(), is the real workhorse
that computes arbitrary derivatives of arbitrary accuracy order.
Example 1: correct dimensionality check results for vector-valued functions.
f2 <- function(x) c(sin(x), cos(x))  # Vector output -> gradient is unsupported
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.
Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.This check correctly identifies non-vectorised functions as well:
f2 <- function(x) c(sum(sin(x)), sum(cos(x)))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.
jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200  0.7568025
Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.
Jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200  0.7568025
#> attr(,"step.size")
#> [1] 6.055454e-06 1.211091e-05 1.816636e-05 2.422182e-05
#> attr(,"step.size.method")
#> [1] "default"Example 2: valid input, invalid output in
numDeriv.
If a function consists of individually vectorised components and
returns an output that differs in length from the input, then,
numDeriv::jacobian may return wrong results. Specifically,
the output should be stacked coordinate-wise gradients, but is made of
stacked diagonal matrices instead.
f2 <- function(x) c(sin(x), cos(x))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023  0.0000000  0.0000000  0.0000000
#> [2,]  0.0000000 -0.4161468  0.0000000  0.0000000
#> [3,]  0.0000000  0.0000000 -0.9899925  0.0000000
#> [4,]  0.0000000  0.0000000  0.0000000 -0.6536436
#> [5,] -0.8414710  0.0000000  0.0000000  0.0000000
#> [6,]  0.0000000 -0.9092974  0.0000000  0.0000000
#> [7,]  0.0000000  0.0000000 -0.1411200  0.0000000
#> [8,]  0.0000000  0.0000000  0.0000000  0.7568025There are 3 methods implemented in numDeriv for gradient
calculation: ‘simple’, ‘Richardson’, and ‘complex’. For simplicity, the
formulæ given here assume scalar arguments; for vector arguments, these
calculations are done coordinate by coordinate.
method = "simple", numDeriv::grad()
computes a fast one-sided difference with a fixed step size (the default
is 0.0001): \(f'_{\mathrm{simple}} :=
\frac{f(x + h) - f(x)}{h}\) or \(\frac{f(x) - f(x-h)}{h}\).method = "Richardson", calls a routine for Richardson
extrapolation (described above).method = "complex" is valid only for those functions
for which the function may take complex inputs and handle complex
outputs. This limits the scope of this method to well-defined convenient
mathematical functions; this excludes the majority of applications in
statistics, biometrics, economics, and data science. However, if there
is such a function, then,
numDeriv::grad(..., method = "complex") will return \(\Im[f(x + \mathrm{i} h)]/h\).This syntax of numDeriv::grad makes consistent step size
selection a chore because of the differences in method implementations.
If method = "simple", the step size is
method.args$eps (defaults to \(10^{-4}\)), and it is absolute. If
method = "Richardson", the step size is the initial step
size in the Richardson extrapolation, and is computed as follows:
Technical remarks.
In numDeriv, the default step size – arguably the most
important tuning parameter of numerical differences – is equal to
d*x + (abs(x)<zero.tol) * eps, where
= 1e-4, d = 1e-4,
zero.tol = sqrt(.Machine$double.eps/7e-7) (approximately
1.8e-5).
eps is used instead of
d for elements of x which are zero’, but the
implementation is different: if x is just below
zero.tol, the actual step is not eps but
d*zero.tol + eps.x transitions from being below just
zero.tol to being just above it, the step size changes from
d*zero.tol + eps to d*zero.tol, i.e. it
decreases, which creates a discontinuity and a counter-intuitive
result.The second point can be illustrated with an example where the step
size for x = 1.7e-5 exceeds that for
x = 1.8e-5, which makes little sense mathematically:
f <- function(x) {cat(x, " "); sin(x)}
grad(f, 1.7e-5, method.args = list(r = 2))  # step 1.0e-4
#> 1.7e-05  0.0001170017  -8.30017e-05  6.700085e-05  -3.300085e-05
#> [1] 1
grad(f, 1.8e-5, method.args = list(r = 2))  # step 1.8e-9
#> 1.8e-05  1.80018e-05  1.79982e-05  1.80009e-05  1.79991e-05
#> [1] 1In pnd, the default step size is decided depending on
the value of x:
x < zero.tol, it is constant and equal to
zero.tol, where the revised value of zero.tol
is sqrt(.Machine$epsilon), or 1.5e-8;x > 1, the step size is equal to x
times the machine epsilon to the appropriate power (inverse of the sum
of derivative and accuracy orders);zero.tol <= x <= 1, the step size is
interpolated exponentially between zero.tol and
macheps^(1 / (deriv.order + acc.order)).The chosen value of zero.tol is reasonable in many
applications in biostatistics and econometrics where non-negativity of
certain parameters in optimisation is often achieved via box constraints
with a comparable distance from the boundary.
The default step size as a function of x is illustrated
below.
xseq <- 10^seq(-10, 2, 0.25)
sseq2 <- stepx(xseq, acc.order = 2)
sseq4 <- stepx(xseq, acc.order = 4)
matplot(xseq, cbind(sseq2, sseq4), lty = 1:2, col = 1, type = "l", bty = "n",
        log = "xy", main = "Default step size", xlab = "Point", ylab = "Step")
legend("topleft", c("2nd-order accurate", "4th-order accurate"), lty = 1:2)A detailed description of the Richardson extrapolation is given in Ridders (1982). To dissect the internals, it suffices to make the function print its input arguments.
f <- function(x) {cat(x, " "); sin(x)}
x0 <- 1
g1 <- numDeriv::grad(f, x0)
#> 1  1.0001  0.9999  1.00005  0.99995  1.000025  0.999975  1.000012  0.9999875
print(g1)
#> [1] 0.5403023
cat("Auto-detected step:", step.SW(sin, x0)$par, "\n")
#> Auto-detected step: 5e-06
hgrid <- 10^seq(-10, -4, 1/32)
errors <- sapply(hgrid, function(h) Grad(sin, x0, h = h, cores = 1,
            elementwise = TRUE, vectorised = TRUE, multivalued = FALSE)) - cos(x0)
plot(hgrid, abs(errors), log = "xy", cex = 0.6)Even with 4 Richardson iterations, the default final step size is
1.25e-5, which is larger than the optimal step size, which
is less than 1e-5 (visible from the plot), the empirically
determined step size 5e-6, or even the rule-of-thumb value
MachEps^(1/3). The equivalence between
numDeriv and pnd implementations of the
gradient (the latter being more general) can be established by computing
the optimal weights in the finite difference for the points in the
iterations of the Richardson extrapolation.
In this example, the default sequence of steps is \(\{h_i\}_{i=1}^4 = 10^{-4} / \{1, 2, 4,
8\}\), and the resulting stencil is \(\{\pm h_i\}_{i=1}^4\). The respective
weights can be calculated via fdCoef():
b <- fdCoef(stencil = c(-(2^(3:0)), 2^(0:3)))
print(b)
#> $stencil
#> [1] -8 -4 -2 -1  1  2  4  8
#> 
#> $weights
#>          x-8h          x-4h          x-2h          x-1h          x+1h 
#>  2.204586e-05 -3.703704e-03  1.185185e-01 -7.223986e-01  7.223986e-01 
#>          x+2h          x+4h          x+8h 
#> -1.185185e-01  3.703704e-03 -2.204586e-05 
#> 
#> attr(,"remainder.coef")
#> [1] -0.01128748
#> attr(,"accuracy.order")
#> requested effective 
#>        NA         8 
#> attr(,"expansion")
#> [1] "f' - 1.1287e-02 f^(9) + ..."Here, the weights on the outermost points – the first step of the
iteration with \(h_1\) – are minuscule
(2.2e-5). The relative importance of each iteration for
this specific function at this specific point can be found from the
equation \[\begin{multline*}
f'_{\mathrm{Rich,8}} = \sum_{i=1}^4 w_{i}[f(x + h_i) - f(x + h_i)],
\end{multline*}\] where \(\{w_i\}_{i=1}^4 \approx \{+0.608, -0.0997,
+0.0031, -0.00002\}\). These values are available as a length
analytical expression with denominator 45360, but in this case, they
were calculated numerically:
fd <- sin(x0 + b$stencil / 8 * 1e-4) * b$weights
abs(fd[1:4]) / sum(abs(fd[1:4]))
#>         x-8h         x-4h         x-2h         x-1h 
#> 2.609937e-05 4.384834e-03 1.403170e-01 8.552721e-01The absolute contribution of each term is proportional to specific terms of a certain polynomial. Practically, it implies that the weights of the summation terms further away from the point of interest decay exponentially (up to a certain constant), and similar accuracy can be achieved with fewer evaluations. In this example, 85.5% of the sum is defined by the finite difference with \(h_4\), and 14.0% with \(h_3\). Therefore, the following function is twice as cheap, and the difference between the estimated derivative values is negligible.
g2 <- Grad(f, x0, h = 1.25e-05, acc.order = 4, report = 0,
           elementwise = TRUE, vectorised = TRUE, multivalued = FALSE)
#> 0.999975  0.9999875  1.000012  1.000025
print(g2)
#> [1] 0.5403023
c(diff = g1 - g2, Error8 = cos(x0) - g1, Error4 = cos(x0) - g2)
#>          diff        Error8        Error4 
#> -2.377543e-12  4.636624e-12  2.259082e-12In this example, the approximation errors from the Richardson
extrapolation and a much cheaper weighted sum have the same order of
magnitude. By not-so-unlikely coincidence, the actual error of
Grad is less than that of grad, which is due
to the unpredictable numerical error that is not dominated by the
bounded higher-order derivatives truncated in the 4th-order-accurate
derivative.
Choosing a large initial step and subsequent shrinking are wasteful because similar, if not practically equivalent, accuracy can be achieved with twice as few evaluations with a reasonably chosen step size. If the function is to be evaluated many times, selecting the optimal step size via a data-driven procedure not only saves time whilst attaining comparable accuracy but also provides opportunities for speed-ups via parallel evaluation of the function on a grid. Richardson extrapolation, on the other hand, is typically computed in a loop, and since its fully parallelised implementation is fully subsumed by the weighted-sum approach, the new package does not dedicate any special numerical routine to this particular case.
Finally, the choice of an optimal non-uniform evaluation grid for a specific function at a specific point remains an open question. One short answer would be ‘the higher the curvature, the smaller the step’, which is the principle used in plotting software to trace the most accurate paths with the fewest evaluations. Another answer comes from the field of optimal polynomial approximation and involves such solutions as Chebyshev polynomials, Padé approximants, and most importantly, the Remez algorithm. Applying these methods to numerical derivatives requires careful handling of differences because the goal is approximating an unknown function that is evaluated with a rounding error, therefore, at least an arbitrary-precision numerical tool must be used in evaluating the derivatives.
The term ‘extrapolation group’ can be found, e.g., in Lindfield and Penny (1989), where it is used in
sections on numerical integration and differentiation and describes
sub-intervals on which polynomials of different degrees are fitted to
improve approximation accuracy. Computer programmes in the 1980s were
often optimised for memory use and size, not necessarily for ease of
interpretation or transparency of their correspondence to the
theoretical relationships that they were translating into numbers.
Therefore, a reader might get confused by the terms ‘extrapolation
group’, ‘improvement iteration’, and ‘accuracy order’. The
pnd package aims to provide more straightforward diagnostic
information.
Suppose that one wants to compute the derivative of \(f(x) = x^7\) at \(x_0 = 1\) – therefore, the true derivative
value is \(f'(1) = 6\). The
following code effectively debugs numDeriv::grad() and
shows how many values are computed (we choose the initial step size
\(h=2^{-10} = 1/1024\) to avoid any
representation error of \(x_0+h\)):
f <- function(x) {print(x, digits = 16); x^9} 
fp1 <- numDeriv::grad(f, x = 1, method.args = list(r = 4, d = 2^-10, show.details = TRUE))
#> [1] 1
#> [1] 1.0009765625
#> [1] 0.9990234375
#> [1] 1.00048828125
#> [1] 0.99951171875
#> [1] 1.000244140625
#> [1] 0.999755859375
#> [1] 1.0001220703125
#> [1] 0.9998779296875
#> 
#>  first order approximations 
#>               [,1]
#> [1,] 9.00008010876
#> [2,] 9.00002002717
#> [3,] 9.00000500679
#> [4,] 9.00000125170
#> 
#>  Richarson improvement group No.  1 
#>               [,1]
#> [1,] 8.99999999997
#> [2,] 9.00000000000
#> [3,] 9.00000000000
#> 
#>  Richarson improvement group No.  2 
#>      [,1]
#> [1,]    9
#> [2,]    9
print(fp1, digits = 16)
#> [1] 9.000000000000064In total, the function was called 9 times: 1 for the initial
dimensionality check and 2 times per each of the 4 iterations (since ).
In this implementation, the terminology differs from the textbook that
it is referring to: numDeriv’s ‘first-order approximations’
correspond to Lindfield & Penny’s ‘Group 1’, ‘Richarsdon improvement
group No. 1’ to LP’s ‘Group 2’, and the final output returned by
grad() would be ‘Group 4’.
In central differences, \(f(x_0)\)
is not used; with 8 symmetric evaluations, the truncation error of the
result is \(O(h^8)\).
Therefore, the standard textbook results about the optimal step
size for central first differences being proportional to \(\epsilon_{\mathrm{mach}}^{1/3}\)
are not applicable to numDeriv::grad()
because the default accuracy order is 8 and the step size of the
order \(\epsilon_{\mathrm{mach}}^{1/3}\) is
appropriate in this case.
With pnd, there is no need to re-define the function to
save its computed values because both the grid and the function values
are returned as the attribute.
# f <- function(x) x^9
# fp2 <- pnd::Grad(f, x = 1, h = "SW", acc.order = 8, vectorised1d = TRUE, report = 2)
# print(attributes(fp2)$step.search$iterations, digits = 16)This highlights an important difference between the two packages:
numDeriv::grad() starts at a relatively large step size
\(h = 10^{-4} \cdot \max(|x_0|, 1)\)
and repeatedly shrinks it, yielding an exponentially decreasing sequence
\(h, h / v, h / v^2, \ldots\) using the
same reduction factor \(v\) to attain
the desired accuracy order \(a = 2v\) –
slow due to many evaluations, but rather accurate;pnd::Grad() starts at the rule-of-thumb step size of
the correct order \(\epsilon_{\text{mach}}^{1/(1+a)}\). By
default, it computes central derivatives \(f'_{\mathrm{CD}}(x, h)\) with accuracy
order \(a = 2\) – fast due to
fewer evaluations, smaller default step size that agrees with numerical
analysis literature, estimates of the approximation error are
provided. If accuracy order \(a >
2\) is requested, then, the evaluation grid is linear: \(x \pm h, x \pm 2h, x \pm 3h, \ldots\).Finally, the method argument show.details = TRUE is
ignored for Hessians in numDeriv:
g <- function(x) sum(sin(x))
hessian(g, 1:3, method.args = list(show.details = TRUE))
#>               [,1]          [,2]          [,3]
#> [1,] -8.414710e-01  4.072455e-14 -2.864747e-13
#> [2,]  4.072455e-14 -9.092974e-01  1.358524e-14
#> [3,] -2.864747e-13  1.358524e-14 -1.411200e-01This is due to the fact that the hessian.default()
method calls genD(), but the latter never checks if
method.args$show.details is TRUE in the loops
where the extrapolation is carried out. This silent operation mode was
probably implemented to avoid output verbosity since there are many
elements in matrices. Nevertheless, any user who has not looked at the
source code of hessian() would be puzzled by this
unexpected behaviour because the manual of ?hessian
explicitly refers to ?grad and says that
method.args is passed to grad.