Introduction
CUDASW++ is a well-established state-of-the-art bioinformatics software for Smith-Waterman protein database searches that takes advantage of the massively parallel CUDA architecture of NVIDIA GPUs to perform sequence searches 10x-50x faster than NCBI BLAST. In this algorithm, we deeply explore the SIMT (Single Instruction, Multiple Thread) and virtualized SIMD (Single Instruction, Multiple Data) abstractions to achieve fast speed. This algorithm has been fully tested on CUDA-enabled GPUs with compute capability 1.2 or higher, and has been incorporated to NVIDIA Tesla Bio Workbench.
Downloads
- CUDASW++ 3.0 (v3.1.1)
This distribution is desigend for CUDA-enabled GPUs based on the Kepler architecture (see the paper for more implementation details).
- Add a new option "-qprf" to allow users to specify whether to use the query profile or the query profile variant. By default, the query profile is used. For short queries of lengths < e.g 400, we would recommend using the query profile variant; otherwise, the query profile.
- Add a compiling macro "RT_DEBUG" to eanble to report the runtimes and GCUPS of both the CPU and GPU SIMD computation parts at Stage (ii). This can be made by modifying the Makefile.
- Add a compiling macro "DISABLE_CPU_THREADS" to allow uses to disable CPU threads for evaluation purpose. In this case, only GPUs will be used for the whole computation. This can be made by modifying the Makefile
We welcome any bug reporting and optimization suggestions!
- CUDASW++ 2.0 (v2.0.11)
- Protein sequences
- Click here to know how to customize a scoring matrix?
Citation
- Yongchao Liu, Douglas L. Maskell, Bertil Schmidt: "CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units". BMC Research Notes, 2009, 2:73
- Yongchao Liu, Bertil Schmidt, Douglas L. Maskell: "CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions". BMC Research Notes, 2010, 3:93
- Yongchao Liu, Adrianto Wirawan, Bertil Schmidt: "CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions". BMC Bioinformatics, 2013, 14:117.
Other related papers
- Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA". 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2009), 2009, 121-128
- Yongchao Liu and Bertil Schmidt: "SWAPHI: Smith-Waterman protein database search on Xeon Phi coprocessors". 25th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2014, pp. 184-185. [preprint at arXiv]
- Yongchao Liu, Tuan-Tu Tran, Felix Lauenroth, Bertil Schmidt: "SWAPHI-LS: Smith-Waterman algorithm on Xeon Phi coprocessors for long DNA sequences". 2014 IEEE International Conference on Cluster Computing, 2014, pp.257-265
- Yongchao Liu and Bertil Schmidt: "GSWABE: faster GPU-accelerated sequence alignment with optimal alignment retrieval for short DNA sequences". Concurrency and Computation: Practice and Experience, 2015, 27: 958-972
- Tuan Tu Tran, Yongchao Liu, Bertil Schmidt: "Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi". Parallel Computing, 2016, 54: 128-138
- Haidong Lan, Weiguo Liu, Yongchao Liu and Bertil Schmidt: "SWhybrid: a hybrid parallel framework for large-scale protein sequence database search", 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), 2017, pp. 42-51.
Parameters of CUDASW++ 3.0
Input:
- -query <string>: specify the query sequence file
- -db <string> : specify the database sequence file
Scoring
- -mat <string>: specify the substitution matrix name (default blosum62)
supported matrix names: blosum45, blosum50, blosum62 and blosum80 - -gapo <integer> : specify the gap open panelty (0 ~ 255), (default 10)
- -gape <integer> : specify the gap extension panelty (0 ~ 255), (default 2)
- -min_score <integer>: specify the minimum score reported(default 100)
- -topscore_num <integer>: specify the number of top scores reported(default 10)
Compute
- -qprf <int> : to use query profile instead of its variant, (default 1)
- -use_single <integer> : force to use the single GPU with ID #integer
- -num_gpus <int> : number of GPUs, (default 1)
- -num_threads <int> : number of CPU threads (forced to be >= #num_gpus)
Others:
- -version :print out the version
Parameters of CUDASW++ 2.0
Mode:
- -mod <string>: specify the programming model used (default simt)
supported model names: simt and simd
Input:
- -query <string>: specify the query sequence file
- -db <string> : specify the database sequence file
Scoring
- -mat <string>: specify the substitution matrix name (default blosum62)
supported matrix names: blosum45, blosum50, blosum62 and blosum80 - -gapo <integer> : specify the gap open panelty (0 ~ 255), (default 10)
- -gape <integer> : specify the gap extension panelty (0 ~ 255), (default 2)
- -min_score <integer>: specify the minimum score reported(default 100)
- -topscore_num <integer>: specify the number of top scores reported(default 10)
Compute:
- -use_single <integer> : force to use the single GPU with ID #integer
Others:
- -version :print out the version
Installation and Usage
Preparation
- CUDASW++ 2.0
- single or multiple CUDA-enabled GPUs with compute capability 1.2 or higher.
- CUDA toolkit 2.0 or higher and the SDK (From CUDASW++ 2.0.11, we have removed the dependency on CUDA SDK. Users do not need to install SDK any more). The SDK is independent of the toollit and can be downloaded from here.
- CUDASW++ 3.0
- single or multiple CUDA-enabled GPUs with compute capability 3.0 or higher.
- CUDA toolkit 4.2 or higher and the SDK (From CUDASW++ 3.1.1, we have removed the dependency on CUDA SDK. Users do not need to install SDK any more). The SDK is independent of the toollit and can be downloaded from here
Download and compiling
- download and unzip the source code
- type command "make" to compile the source code and an executable binary "cudasw" is generated.
- Please check the compute capability of your GPU device before compiling. For Kepler-based GPUs with compute capability 3.5 (3.0), specify "-arch sm_35" (-arch sm_30); For Fermi-based GPus with compute capability 2.0, use"-arch sm_20"; For device capability 1.3, use "-arch sm_13"; for 1.2, use "-arch sm_12"
Typical Usage
- CUDASW++ 2.0
- ./cudasw, ./cudasw -? or ./cudasw -help to get all the supported parameters.
- ./cudasw -mod simt -query query_file -db db_file -mat blosum62
- ./cudasw -mod simd -query query_file -db db_file -mat blosum45
- ./cudasw -query query_file -db db_file -mat blosum62 -gapo 20 -gape 2
- ./cudasw -mod simd -query query_file -db db_file -gapo 20 -gape 2
- ./cudasw -query query_file -db db_file -use_single 0
- ./cudasw -query query_file -db db_file -use_single 1
- CUDASW++ 3.0
- ./cudasw, ./cudasw -? or ./cudasw -help to get all the supported parameters.
- ./cudasw -qprf 0 -query query_file -db db_file -mat blosum62
- ./cudasw -qprf 1 -query query_file -db db_file -num_threads 8 -num_gpus 2
- ./cudasw -qprf 0 -query query_file -db db_file -use_single 1
Important notes
- query_file and database_file use FASTA format files and many query/subject sequences can be stored in a single file respectively (recommended).
- For CUDASW++ 2.0, two models are supported: simt and smid. The simt model uses the optimized SIMT Smith Waterman algorithm, which is independent of the scoring scheme used. The simd model uses the partitioned vectorized Smith Waterman algorithm, which is kind of sensitive to the scoring scheme used.
- For CUDASW++ 3.0, users can use either a query profile or a query profile variant. When the L2 cache is larger (e.g. more than 512 K), we recommend using the query profile variant. For shorter queries, the query profile variant is the better.
- Supported scoring matrix names: blosum45, blosum50, blosum62 and blosum80. if the scoring matrix is not specified or not supported, blosum62 is used by default
- The default gap open penalty is 10 and gap extension penalty is 2.
- When using a single GPU (option -use_single), you can specifiy the index of the single GPU used for the comptuation. The index starts from 0.
Change Log
- March 25, 2015 (v2.0.11 and v3.1.1)
- Removed the dependency on CUDA SDK and users do not need to install NVIDIA CUDA SDK anymore.
- March 18, 2014 (v3.1)
- Fixed a bug in the generation of the query profile variant. We thank Haidong Lan from Shandong University, China for his bug report.
- March 04, 2014 (v3.0.15)
- Changed the return code of the main function, so that when running successfully, the returned code is zero and otherwise, -1;
- December 04, 2012 (v2.0.10)
- In version 2.0.9, the packing format for multi-GPU version was forgotten to be changed. This version fixed this problem.
- September 03, 2012 (v2.0.9)
- Changed the packing format from uchar4 to unsigned int for the subject sequences in the inter-task parallelization.
Contact
If any questions or improvements, please feel free to contact Liu, Yongchao.