Fast Two Stage P-value Computation Software

Introduction


FastPval is a two stage p-value computation software which compute the empirical p-value by two stage ranking strategy, it can produce very low P-value based on huge dataset. This fast and powerful tool takes advantage of a delicate cutoff which separate the exactly significant area. Compared to the traditional ranking method, Tspvc has a good time efficiency, lower memory consuming and tiny storage spaces with high accuracy.

Traditionally, exact P-value computation based on permutation is needed for sorting the dataset before ranking. For huge dataset(more than 1 billion values), the time complexity of sorting time will be long(nlogn). Furthermore, surprisingly, the space complexity will be very large(1 billion dataset will take ~4G memory).

We adopt this two stage strategy in order to reduce the computation time and overcome memory restriction using the exact P-value computation method especially in the huge dataset. At first stage, we make a random sampling array (which means the first model) with given sampling number (up to the volume of dataset) and sort it. Then we select the satisfied array (which means the second model) from the dataset with the cutoff number which fetched form the first sampling array. So, directly, we can compute the approximate P-value with those two model.

The core procedure is implemented by C. We also offer a Java-based GUI by encapsulating the C package.

Features


Data Format


The dataset and test sample should follow the format:    dataset example    test sample

1.323204
2.323434
34.767567
0.565465
-2.432545
5.675477
No space, new line, tab, etc. before the first data, each data should follow a '\n' at the end of line.

Usage


There are three ways to launch the software:
   Please install the latest sun JRE in your local machine firstly.
1. Running from Java Web Start by clicking Launch the application 32bit / 64bit .

2. You can download 32bit / 64bit the package and run using following commands:

java -cp [the path of pvalue.jar] pvalue.MainGUI

3. Run the procedure by command line:
  1). Quick start:
java -cp <pvalue jar file path> pvalue.Pvalue -m 0 -d <dataset> -s <test sample>
-o <output path>
  2). Global parameters:
Required parameters:
   -m <Integer> The computation method:
      0: the two stage method
      1: the exact method(brute ranking)
      
Optional parameters:(default in [ ]): 
   -i <Integer> The instruction of current method [0]
      This parameter is effective only when '-m' is 0
      0: the general two stage method
      1: generate the model
      2: compute from given model
  3). The general two stage method (when '-i' is 0):
Required parameters:
   -d <String> The dataset file path
   -s <String> The test sample file path
   -o <String> The output directory
   
Optional parameters:(default in [ ]):
   -n <Integer> The sampling number for first model [100000]
      it is recommended (100000--<100M; 1000000--<10B;)
   -c <Integer> The P-value for the cutoff of first distribution [0.001]
   -a <Integer> The assumed distribution of dataset [0]
      0: normal distribution
      1: extreme distribution
  4). Generate model (when '-i' is 1):
Required parameters:
   -d <String> The dataset file path
   -o <String> The output directory
   
Optional parameters:(default in [ ]):
   -n <Integer> The sampling number for first model [100000]
      it is recommended (100000--<100M; 1000000--<10B;)
   -c <Integer> The P-value for the cutoff of first distribution [0.001]
  5). Compute P-value (when '-i' is 2):
Required parameters:
   -b1 <String> The first model file path
   -b2 <String> The second model file path
   -s <String> The test sample file path
   -o <String> The output directory
   
Optional parameters:(default in [ ]):
   -a <Integer> The assumed distribution of dataset [0]
      0: normal distribution
      1: extreme distribution
  6). The exact method (brute ranking) (when '-m' is 1):
Required parameters:
   -d <String> The dataset file path
   -s <String> The test sample file path
   -o <String> The output directory

4. Source code download .

Evaluation


1). Complexity
Procedure has been tested and evaluated in our server (Intel Xeon CPU E5410 2.33GHz; 16G Memory; -n:100000; -c:0.001)
Dataset Size FastPval(Time) Exact Computation(Time) FastPval(RAM) Exact Computation(RAM) FastPval Model Size
1M/12.1MB 1.05s+0.08s+1.53s 1.10s+0.74s+2.33s 0.39MB 4MB 1.3MB+13KB
10M/121.4MB 9.29s+0.09s+16.07s 11.21s+7.61s+29.88s 0.39MB 38MB 1.3MB+131KB
100M/1.2GB 91.46s+0.14s+249.44s 116.73s+77.13s+332.13s 0.78MB 373MB 1.3MB+1.3MB
500M/5.9GB 455.23s+0.40s+1297.44s 677.58s+380.47s+1885.12s 2MB 1.9GB 1.3MB+6MB
1B/11.9GB 919.45s+0.72s+2530.65s 1409.61s+761.52s+4019.77s 4MB 3.7GB 1.3MB+11.9MB
5B/59.9GB 5475.32s+3.34s+12885.87s N/A 19MB N/A(>18.5GB) 1.3MB+59.8MB

Table: The comparison between FastPval and Exact P-value Computation(brutal sorting). Dateset Size: The number of dataset(M: million; B: billion) and the file size.  FastPval(Time): The model generating time and model loading time and P-value computation time using current dataset as test sample.  Exact Computation(Time): The time of sorting dataset and background loading time and exact P-value computation time by binary search using current dataset as test sample.  FastPval(RAM): The total memory consumption of FastPval.  Exact Computation(RAM) : The total memory consumption using exact P-value computation.  FastPval Model Size : The two model files size of FastPval.

Generally, the time complexity of the two stage P-value Computation is 2N and the time complexity of the Exact P-value Computation is NlogN. The space complexity of two stage P-value Computation is changed by sampling number and cutoff number, but it will be very small.

2). Accuracy
Compared to the Exact computation, we test the accuracy of the two stage P-value Computation using Q-Q plot.
normal 1 billion poission 1 billion gumbel 1 billion

Development


We will continue to develop next version with advanced functions and good usability.

Acknowledgments


The research was supported by internal funds from CRCG, CRDG and genomic SRT of HKU, GRF 778609M and AoE M-04/04 from RGC of Hong Kong.

Citation


FastPval: a fast and memory efficient program to calculate very low P-values from empirical distribution.
Mulin Jun Li, Pak Chung Sham, and Junwen Wang Bioinformatics. 2010 Nov 15;26(22):2897-9. Epub 2010 Sep 21.