lots of small changes to the thesis.

author: Dennis Brentjes <d.brentjes@gmail.com> 2017-05-28 18:34:44 +0200
committer: Dennis Brentjes <d.brentjes@gmail.com> 2017-05-28 18:35:08 +0200
commit: 610b3f96ec31ee6192d46767dedae9d9efaedf9b (patch)
tree: 8568071a7b8df54d1e64c9ded9852143263372fd /content/implementation.tex
parent: 6ff78ac5b7b36ada3028d2d5380fa3dbe35bbd66 (diff)
download: thesis-610b3f96ec31ee6192d46767dedae9d9efaedf9b.tar.gz
thesis-610b3f96ec31ee6192d46767dedae9d9efaedf9b.tar.bz2
thesis-610b3f96ec31ee6192d46767dedae9d9efaedf9b.zip
1 files changed, 25 insertions, 13 deletions
diff --git a/content/implementation.tex b/content/implementation.tex
index 76b7275..3188d65 100644
--- a/content/implementation.tex
+++ b/content/implementation.tex
@@ -1,34 +1,46 @@
-\section{Elgamal in Cyclic Group and Elliptic Curve}
+\section{Implementation}
 
-The goal of this research is to see how differently these two ways of using Elgamal in this crypto scheme effect things like latency and troughput in CMix. But also doing this in as reusable way possible so that the implementation is of use for further research.
+A large part of this research is actually making an implementation of the protocol in such a way that it;
 
-This means I will be trying to find cryptographic libraries that do most of the work while I focus on getting a uniform API over different backends and the correct implementation of said backends. Unfortunately elgamal is not a very popular encryption algorithm in modern crypto libraries. and even if it is supported with a high level interface it is very rigid and wouldn't allow for any secret sharing as needed in CMix. Because of this I needed access to lower level cryptographic primitives. Which I found in the libgcrypt library. The next step is to create my own CMix library in ``C'' which abstracts the lower lever crypto primitives for both ``cyclic group'' and ``elliptic curve'' elgamal. This makes it easy to switch between implementations and test without changing any of the application code. This library is written in ``C'' to make interfacing with it from other languages used in the application level easier. For the application level code I used ``C++'', but in theory this should be easily swapped for a different application language.
+\begin{itemize}
+	\item Supports multiple but subtly different cryptographic back-ends.
+	\item Is debuggable and reusable.
+	\item Allows back-ends to be comparably benchmarked.
+\end{itemize}
 
-However using libgcrypt makes it possible to implement CMix, it doesn't mean it was easily done. The library still lacks some convenience functions that needed workarounds in the eventual implementation. This is especially the case for the elliptic curve backend. Some specific problems will be discussed later.
+The following section will talk about some implementation specific things, to talk about how I achieved those goals or why something might need some attention for future research. For more information on where to find the implementation see \ref{app:impl}
+
+\subsection{ElGamal in Cyclic Group and Elliptic Curve}
+
+The goal of this research is to see how differently these two ways of using ElGamal in this crypto scheme effect things like latency and throughput in \cmix. But also doing this in as reusable way possible so that the implementation is of use for further research.
+
+This means I will be trying to find cryptographic libraries that do most of the work while I focus on getting a uniform API over different back-ends and the correct implementation of said back-ends. Unfortunately ElGamal is not a very popular encryption algorithm in modern cryptographic libraries. and even if it is supported with a high level interface it is very rigid and wouldn't allow for any secret sharing as needed in \cmix. Because of this I needed access to lower level cryptographic primitives. Which I found in the libgcrypt library. The next step is to create my own \cmix library in ``C'' which abstracts the lower lever cryptographic primitives for both ``cyclic group'' and ``elliptic curve'' ElGamal. This makes it easy to switch between implementations and test without changing any of the application code. This library is written in ``C'' to make interfacing with it from other languages used in the application level easier. For the application level code I used ``C++'', but in theory this should be easily swapped for a different application language.
+
+However using libgcrypt makes it possible to implement \cmix, it doesn't mean it was easily done. The library still lacks some convenience functions that needed workarounds in the eventual implementation. This is especially the case for the elliptic curve back-end. Some specific problems will be discussed later.
 
 \subsection{Wire format}
-For cmix to work, we need to send group elements of the cryptographic algorithms from one node to the next. For cyclic group this is easy. They are just big integers with a max size and can be sent as arrays of a fixed length.
+For \cmix to work, we need to send group elements of the cryptographic algorithms from one node to the next. For cyclic group this is easy. They are just big integers with a max size and can be sent as arrays of a fixed length.
 
 When serializing elliptic curve points, there are several choices. You can send both affine coordinates. You can send only the y coordinate, in the case of ed25519. This means you need to calculate, one of the, corresponding x'es when de-serializing. Or you can send the projective coordinates. I chose in this implementation to send both affine $x$ and $y$ coordinates. This forces each node to calculate the inverse of the internally used projective $Z$ coordinate. Which is relatively slow. 
 
 The main reason for this is that the raw affine $x$ and $y$ coordinate are easier to inspect and debug. On top of that libgcrypt is not very forthcoming on documentation of their internal structure, especially in their elliptic curve library. This means I was not sure hoe to properly serialize the projective coordinates of a point. So just for now keep in mind that the overall implementation could be a bit faster when these unnecessary inversions of $Z$ would be eliminated.
 
-So as to not worry network level things I used Google protobuf to serialize these messages further. This also means that you could have mutliple nodes which are implemented in different languages interact with each other.
+So as to not worry network level things I used Google protobuf to serialize these messages further. This also means that you could have multiple nodes which are implemented in different languages interact with each other.
 
 \subsection{Mapping strings to Ed25519 points}
 
-There is however the Koblitz's method\cite{Koblitz} This is a probabilistic method of finding a curve point. it starts by choosing a number that will be used as a  ``stride''. The string you want to encode as a point has to be interpreted as a group element and multiplied by the stride. The result also has to be a group element so your message space is made smaller by a factor ``stride''. Now you can check if your result has a corresponding x coordinate. If it has one you are done, if it doesn't you add $1$ to your current trial y coordinate and check again. This continues up until your trial y coordinate reaches the value of  $first\_trial\_y\_coordinate + stride$. If you haven't found a suitable y coordinate by now this mapping algorithm fails.
+There is however the Koblitz's method\cite{Koblitz}. This is a probabilistic way of finding a curve point for a random integer in such a way that it is reversible. It starts by choosing a number that will be used as a  ``stride''. The string you want to encode as a point has to be interpreted as a group element and multiplied by the stride. The result of this multiplication must also be a group element. This means your message space is made smaller by a factor ``stride''. Now you can check if your result has a corresponding x coordinate. If it has one you are done, if it doesn't you add $1$ to your current trial y coordinate and check again. This continues up until your trial y coordinate reaches the value of  $first\_trial\_y\_coordinate + stride$. If you haven't found a suitable y coordinate by now this mapping algorithm fails.
 
 You can map a point back to the original string by dividing the y coordinate with the ``stride''. This works because integer division and only adding less than the stride to the original random number.
 
-The problem with this probabilistic Koblitz's method is choosing your ``stride''. There is no way of knowing what the max distance between two consecutive suitable y coordinates could be, half of the possible group elements would be suitable, but it is impossible to list all the Ed25519 curve points and check. This makes it an unsolvable problem, but we can make educated guesses and stay on the safe side that guess.
+The problem with this probabilistic Koblitz's method is choosing your ``stride''. There is no way of knowing what the max distance between two consecutive suitable y coordinates could be. We do know that about half of the possible group elements would be suitable in Ed25519, but it is impossible to list all the Ed25519 curve points and check. This makes it an unsolvable problem, but we can make educated guesses and stay on the safe side that guess.
 
-Now to address the concern that you divide your message space by your stride. This theoretically also effects your throughput. This only affects you if you would optimally pack your messages in the 252 bit message space you have available in Ed25519. However if you only use the lower 248 bits, which gives you 31 byte message space which saves time and effort packing your messages optimally. you have 5 bits to use as a stride. Which, anecdotally, seems to be enough. Ofcourse you can never know if this stride will work for all possible messages. But for the purpose of this benchmark it seems to work well enough. Maybe it is possible to find out how stride influences the chance of not finding a suitable y coordinate. But that is outside of the scope of this research.
+Now to address the concern that you divide your message space by your stride. This theoretically effects your throughput, however it only does if you would optimally pack your messages in the 252 bit message space you have available in Ed25519. However if you only use the lower 248 bits, which gives you 31 byte message space. You have 5 bits to use as a stride. Which, anecdotally, seems to be enough. Of course you can never know if this stride will work for all possible messages. But for the purpose of this benchmark it seems to work well enough. Maybe it is possible to find out how stride influences the chance of not finding a suitable y coordinate. But that is outside of the scope of this research.
 
-A couple of small but nice benefits of using $32$ as stride. Multiplication and division are bit shifts as $32$ is a power of $2$. For debugging purposes you might want to consider a stride of $16$. This makes any hexadecimal representation of a number just shift up one character. However in actual runs of the algorithm with a stride of $16$ some runs would fail because there was no suitable $y$ coordinate within $16$ values of the message value. This has not yet happened for $32$ yet. To reiterate this is no guarantee that it will never happen.
+A couple of small but nice benefits of using $2^5 = 32$ as stride. Multiplication and division are bit shifts as $32$ is a power of $2$. For debugging purposes you might want to consider a stride of $16$. This makes any hexadecimal representation of a number just shift up one character. However in actual runs of the algorithm with a stride of $16$ was insufficient. Some of the runs would fail because there was no suitable $y$ coordinate within $16$ value range. This has not yet happened for $32$ yet. To reiterate this is no guarantee that it will never happen.
 
-\subsection{Debugging the cmix operations.}
+\subsection{Debugging the \cmix operations.}
 
-Debugging the cmix operations is hard. Intermediate results produced by nodes are hard to validate for the real 2048 bit numbers used in the cyclic group and the 256 bit numbers used in the elgamal example. Fortunately it is possible to use smaller cyclic groups and smaller curves to debug structural issues in the algorithms. For the cyclic groups this works like a charm, as the operations are pretty easy to do by hand or calculator, The operations for elliptic curve are a lot less intuitive, even for small curves. Especially with no known good library that implements these operations. This makes debugging some of the mix algorithm primitives tedious and time consuming.
+Debugging the \cmix operations is hard. Intermediate results produced by the nodes are hard to validate for the real 2048 bit numbers used in the cyclic group and the 256 bit numbers used in the ElGamal cases. Fortunately it is possible to use smaller cyclic groups and smaller curves to debug structural issues in the algorithms. For the cyclic groups this works like a charm, as the operations are pretty easy to do by hand or calculator, The operations for elliptic curve are a lot less intuitive, even for small curves. Especially with no known good library that implements these operations. This makes debugging some of the mix algorithm primitives tedious and time consuming.
 
-Some tools that have been a great help in creating the implementation are AddressSanitizer\cite{ASan} and the companion leak check tool LeakSanitizer\cite{LSan}. These tools check all the executed code paths for any object or array access outside of the allowed bounds. This is no guarantee the application is completely memory safe. It does ensure that any out of bounds access is detected, even if the access was still technically in valid memory just not part of the object or array. The leak tool also helped to keep an eye on the tokens passed around by the CMix library. Because the implementation language of the library was $C$ and there are 2 different implementations for the 2 different elgamal backends. There is a lot of token passing. These tokens need to be destructed but have no clear overarching owner. This makes it harder to keep track of object lifetime. So LeakSanitizer helped out tracking these lost tokens. Which allowed me to eventually eliminate memory leaks. Which in turn allowed me to run the benchmark for a high number of clients on a system with relatively low specs.
+Some tools that have been a great help in creating the implementation in general are AddressSanitizer\cite{ASan} and the companion leak check tool LeakSanitizer\cite{LSan}. These tools check all the executed code paths for any object or array access outside of the allowed bounds. This is no guarantee the application is completely memory safe. It does ensure that any out of bounds access is detected, even if the access was still technically in valid memory just not part of the object or array. The leak tool also helped to keep an eye on the tokens passed around by the \cmix library. Because the implementation language of the library was $C$ and there are 2 different implementations for the 2 different ElGamal back-ends. There is a lot of token passing. These tokens need to be destructed but have no clear overarching owner. This makes it harder to keep track of object lifetime. So LeakSanitizer helped out tracking these lost tokens. Which allowed me to eventually eliminate memory leaks. Which in turn allowed me to run the benchmark for a high number of clients on a system with relatively low specs.
author	Dennis Brentjes <d.brentjes@gmail.com>	2017-05-28 18:34:44 +0200
committer	Dennis Brentjes <d.brentjes@gmail.com>	2017-05-28 18:35:08 +0200
commit	610b3f96ec31ee6192d46767dedae9d9efaedf9b (patch)
tree	8568071a7b8df54d1e64c9ded9852143263372fd /content/implementation.tex
parent	6ff78ac5b7b36ada3028d2d5380fa3dbe35bbd66 (diff)
download	thesis-610b3f96ec31ee6192d46767dedae9d9efaedf9b.tar.gz thesis-610b3f96ec31ee6192d46767dedae9d9efaedf9b.tar.bz2 thesis-610b3f96ec31ee6192d46767dedae9d9efaedf9b.zip