It's going to be convenient for each function that can generate a
MBEDTLS_ERR_ECP_IN_PROGRESS on its own (as opposed to just passing it around)
to have its own restart context that they can allocate and free as needed
independently of the restart context of other functions.
For example ecp_muladd() is going to have its own restart_muladd context that
in can managed, then when it calls ecp_mul() this will manage a restart_mul
context without interfering with the caller's context.
So, things need to be renames to avoid future name clashes.
From a user's perspective, you want a "basic operation" to take approximately
the same amount of time regardless of the curve size, especially since max_ops
is a global setting: otherwise if you pick a limit suitable for P-384 then
when you do an operation on P-256 it will return way more often than needed.
Said otherwise, a user is actually interested in actual running time, and we
do the API in terms of "basic ops" for practical reasons (no timers) but then
we should make sure it's a good proxy for running time.
Ok, so the original plan was to make mpi_inv_mod() the smallest block that
could not be divided. Updated plan is that the smallest block will be either:
- ecp_normalize_jac_many() (one mpi_inv_mod() + a number or mpi_mul_mpi()s)
- or the second loop in ecp_precompute_comb()
With default settings, the minimum non-restartable sequence is:
- for P-256: 222M
- for P-384: 341M
This is within a 2-3x factor of originally planned value of 120M. However,
that value can be approached, at the cost of some performance, by setting
ECP_WINDOW_SIZE (w below) lower than the default of 6. For example:
- w=4 -> 166M for any curve (perf. impact < 10%)
- w=2 -> 130M for any curve (perf. impact ~ 30%)
My opinion is that the current state with w=4 is a good compromise, and the
code complexity need to attain 120M is not warranted by the 1.4 factor between
that and the current minimum with w=4 (which is close to optimal perf).
This is the easy part: with the current steps, all information between steps
is passed via T which is already saved. Next we'll need to split at least the
first loop, and maybe calls to normalize_jac_many() and/or the second loop.
Separating main computation from filling of the auxiliary array makes things
clearer and easier to restart as we don't have to remember the in-progress
auxiliary array.
Previously there were only two states:
- T unallocated
- T allocated and valid
Now there are three:
- T unallocated
- T allocated and in progress
- T allocated and valid
Introduce new bool T_ok to distinguish the last two states.
Free it as soon as it's no longer needed, but as a backup free it in
ecp_group_free(), in case ecp_mul() is not called again after returning
ECP_IN_PROGRESS.
So far we only remember it when it's fully computed, next step is to be able
to compute it in multiple steps.
In case of argument change, freeing everything is not the most efficient
(wastes one free()+calloc()) but makes the code simpler, which is probably
more important here
We'll need to store MPIs and other things that allocate memory in this
context, so we need a place to free it. We can't rely on doing it before
returning from ecp_mul() as we might return MBEDTLS_ERR_ECP_IN_PROGRESS (thus
preserving the context) and never be called again (for example, TLS handshake
aborted for another reason). So, ecp_group_free() looks like a good place to
do this, if the restart context is part of struct ecp_group.
This means it's not possible to use the same ecp_group structure in different
threads concurrently, but:
- that's already the case (and documented) for other reasons
- this feature is precisely intended for environments that lack threading
An alternative option would be for the caller to have to allocate/free the
restart context and pass it explicitly, but this means creating new functions
that take a context argument, and putting a burden on the user.
The plan is to count basic operations as follows:
- call to ecp_add_mixed() -> 11
- call to ecp_double_jac() -> 8
- call to mpi_mul_mpi() -> 1
- call to mpi_inv_mod() -> 120
- everything else -> not counted
The counts for ecp_add_mixed() and ecp_double_jac() are based on the actual
number of calls to mpi_mul_mpi() they they make.
The count for mpi_inv_mod() is based on timing measurements on K64F and
LPC1768 boards, and are consistent with the usual very rough estimate of one
inversion = 100 multiplications. It could be useful to repeat that measurement
on a Cortex-M0 board as those have smaller divider and multipliers, so the
result could be a bit different but should be the same order of magnitude.
The documented limitation of 120 basic ops is due to the calls to mpi_inv_mod()
which are currently not interruptible nor planned to be so far.
This is the first step towards making verify_chain() iterative. While from a
readability point of view the current recursive version is fine, one of the
goals of this refactoring is to prepare for restartable ECC integration, which
will need the explicit stack anyway.