**1.1 Problem Statement and Motivation **

Cryptography plays an important role in our daily life. Nowadays cryptography uses public key-encryption which involved a set of multiple-precision integer operations. A server such as SSL server that relies on public-key encryption needs to compute a large number of multiple-precision integer operation requires large computing power (Kaiyong Zhao & Xiaowen Chu 2010). Modern PCs are very good at computing numbers whose length does not exceed 32 bits or 64 bits. However, when numbers exceeded the limit, computer arithmetic will fail. Different constraints imposed by the underlying hardware architecture and programing language cause the failure (Youssef Bassil & Aziz Barbar 2004). Therefore, different algorithms and technique were developed to solve the problem of arithmetic computation on big number on the CPU as well as GPU.

Recent improvement in Graphics Processing Units (GPU) ushered a new era of GPU computing (Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E & Phillips, J. C 2008). For example, commodity GPUs like NVIDIAβs GTX 1070 has 1920 processing cores and can achieve 5783 GFLOPS of computational horsepower.

We are inspired by the point that end users and application servers can employ GPUs to increase the computation speed of multiple-precision integer operations. However, there is difficulty to achieve high performance on GPUs due to the complicated memory architecture and the relatively slow integer operations. (Kaiyong Zhao & Xiaowen Chu 2010)

GPU is a type of processor that is responsible to compute computer graphic. GPUs were originally created for a high-performance workstation and therefore costly. In early 1990, 3D games with rendering processor were appearing since the 3D accelerator hardware was made. API OpenGL was basically from a graphic application of professional workstation and adopted to make a graphic 3D game programming, the same as the emergence of the DirectX and Direct3D (Khoirudin, & Shun-Liang, J 2015).

BCS (Hons) Computer Science

Faculty of Information and Communication Technology (Perak Campus), UTAR 2

In addition, GPU became more affordable and more powerful. Recently, due to the promotion of gaming on PC, the development of GPU overtook the CPU development.

The rapid transformation on GPU allows improvement in parallel using the multiple cores. This is more effective when the programmer wants to process a lot of vertices or fragments in the similar way. With multi-core that control by very high memory bandwidth, GPU serves with incredible resources for both non-graphics processing and graphics processing at the same time.

On a modern GPU, shader number or often called the Stream Processor (for stream input and output/Stream), has reached the hundreds or even thousands. GPU calculation abilities can reach Terra FLOPS, which hundreds of times faster than the CPU.

Therefore, other non-graphic calculation can be performed in GPU. The comparison between GPU and the CPU is shown in Table 1 by (Khoirudin & Shun-Liang, J 2015).

CPU GPU

Parallelism through time multiplexing Parallelism through space multiplexing

Emphasis on low memory latency Emphasis on high memory throughput

Allows wide range of control flows + control flow optimisation

Very control flow restricted Optimised for low latency access to

caches data set

Optimised for data parallel, throughput computation Very high clock speed Mid-tempo clock speed

Peak computation capability low Higher peak computation capability Off-chip bandwidth lower Higher off-chip bandwidth

Handle sequential code well Requires massively parallel computing

Great for task parallelism Great for data parallelism

*Table 1. Comparison between CPU and GPU *

As mentioned above, modern field of cryptography can be divided into Symmetric-key cryptography and Public-key cryptography. One of the practical public-key encryption system is the RSA algorithm.

RSA is one of the first practical public-key cryptosystems and is widely used for secure data transmission. Basically, RSA produces public key based on two large prime

BCS (Hons) Computer Science

Faculty of Information and Communication Technology (Perak Campus), UTAR 3

numbers along with an auxiliary value. The RSA algorithm includes modular arithmetic as well as Montgomery modular exponentiation on the large numbers which have the key size of 1024 to 4096 bit typically. Therefore large integer arithmetic is needed in cryptography.

RSA involves a public key and a private key. The public key is distributed to the public
for message encryption, and only can be decrypted by the private key. In practical, three
very large positive integers *e, d and n such that with modular exponentiation for all *
message m:

(π^{π})^{π} β‘ π (πππ π)

To encrypt the message m such that 0 β€ π < π, we computes the ciphertext c by using the public key e, corresponding to

π β‘ π^{π} (πππ π)

To recover the message, we can decrypt the encrypted message c using private key d by computing

π^{π} β‘ (π^{π})^{π} β‘ π (πππ π)

As we can see, the RSA encryption scheme requires heavy load of computation of modular exponentiation that are computational expensive on CPU. Therefore, we implemented Montgomery Multiplication and exponentiation on GPU to compute multiple messages in parallel to save time.

## RSA

Key generation

Key distribution

Encryption Decryption

BCS (Hons) Computer Science

Faculty of Information and Communication Technology (Perak Campus), UTAR 4

Besides, another public-key cryptosystem is Diffie-Hellman Key Exchange. It was one of the first public-key protocols as originally conceptualised by Ralph Merkle and named after Whitfield Diffie and Martin Hellman by ( Diffie, W. & Hellman, M. 1976).

Diffie-Hellman Key Exchange involved arithmetic exponential to compute the key which should be at least 2048 bits. Elliptic curve cryptography (ECC) is another approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields also implement Montgomery modular exponentiation as well.

Most algorithms in cryptography require expensive computation to perform encryption and decryption multiple times. Therefore, a GPU large integer arithmetic library is needed to support the operations.

**1.2 Project Scope **

The scope of project is to develop a library for multiple-precision integer arithmetic for GPU which provide the operation for non-negative addition, subtraction, multiplication, division, Montgomery exponentiation and multiplication in parallel technique to optimise the usage of GPU. The performance of the library will then be analysed against GMP library for CPU to compare the computational time between GPU and CPU. At last, an optimised implementation of public key cryptography algorithm will be designed based on the developed library.

**1.3 Project Objective **

Although there is existing libraries supporting the GPU, they are not open source.

Besides, the libraries are designed to support only either host side function call or device side function call. Host side function call is when the CPU calls the function and it is passed to GPU to compute and then the result is returned to the CPU whereas device side function call is when the GPU calls the function in GPU, which is good for developing algorithms that use multiple large integer arithmetic operations. The diagram below illustrates the communication between host side and device side, from (Nigerianewsday.com 2016).

BCS (Hons) Computer Science

Faculty of Information and Communication Technology (Perak Campus), UTAR 5
*Figure 1: Host side and device side*

As the libraries implements such principle, there is overheads from the communication between CPU and GPU which will slow down the computation speed. Therefore, a further improvement will be added in this project to overcome this weakness.

**1.4 Impact, significance and contribution **

By developing the open source library, computation time can be reduced in large number arithmetic by using GPUs, which will be useful and convenient in the field of cryptography. The challenges we need to overcome which are to override the need of overheads from communication between CPU and GPU, algorithms and code optimisation to achieve the speed. Besides, a good manual memory management for dynamic memory allocation is needed in the C programming language via a group of functions in the C standard library, namely malloc and free.

**1.5 Background Information **

In this library, large integer is stored and represented in a system of radix 2^{32}. In short,
the large integer is represented in an array of unsigned integer, d0, d1, d2 β¦ dn as d0b^{n} +
d1b^{n-1 }+ β¦ + dnb^{0}, where 0 β€ di < 2^{32}. In such way, memory optimisation can be achieved
as we can fully optimise 4 bytes of memory given in each array index.

Therefore, different parallel algorithms can be applied to perform arithmetic operations on GPU. This project will support arithmetic operations such as (modular) addition, subtraction, multiplication, division and exponentiation, Montgomery multiplication and exponentiation. By having the library equips with arithmetic operations, it enables user to perform large integer calculation for the purpose of cryptography in a fast manner.

BCS (Hons) Computer Science

Faculty of Information and Communication Technology (Perak Campus), UTAR 6