CUDASW++ is a well-established state-of-the-art bioinformatics software for Smith-Waterman protein database searches that takes advantage of the massively parallel CUDA architecture of NVIDIA GPUs to perform sequence searches 10x-50x faster than NCBI BLAST. In this algorithm, we deeply explore the SIMT (Single Instruction, Multiple Thread) and virtualized SIMD (Single Instruction, Multiple Data) abstractions to achieve fast speed. This algorithm has been fully tested on CUDA-enabled GPUs with compute capability 1.2 or higher, and has been incorporated to NVIDIA Tesla Bio Workbench.



  1. Yongchao Liu, Douglas L. Maskell, Bertil Schmidt: "CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units". BMC Research Notes, 2009, 2:73
  2. Yongchao Liu, Bertil Schmidt, Douglas L. Maskell: "CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions". BMC Research Notes, 2010, 3:93
  3. Yongchao Liu, Adrianto Wirawan, Bertil Schmidt: "CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions". BMC Bioinformatics, 2013, 14:117.

Other related papers

  1. Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA". 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2009), 2009, 121-128
  2. Yongchao Liu and Bertil Schmidt: "SWAPHI: Smith-Waterman protein database search on Xeon Phi coprocessors". 25th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2014, pp. 184-185. [preprint at arXiv]
  3. Yongchao Liu, Tuan-Tu Tran, Felix Lauenroth, Bertil Schmidt: "SWAPHI-LS: Smith-Waterman algorithm on Xeon Phi coprocessors for long DNA sequences". 2014 IEEE International Conference on Cluster Computing, 2014, pp.257-265
  4. Yongchao Liu and Bertil Schmidt: "GSWABE: faster GPU-accelerated sequence alignment with optimal alignment retrieval for short DNA sequences". Concurrency and Computation: Practice and Experience, 2015, 27: 958-972
  5. Tuan Tu Tran, Yongchao Liu, Bertil Schmidt: "Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi". Parallel Computing, 2016, 54: 128-138
  6. Haidong Lan, Weiguo Liu, Yongchao Liu and Bertil Schmidt: ”SWhybrid: a hybrid parallel framework for large-scale protein sequence database search”. 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), 2017, in press.

Parameters of CUDASW++ 3.0





Parameters of CUDASW++ 2.0






Installation and Usage


Download and compiling

Typical Usage

Important notes

  1. query_file and database_file use FASTA format files and many query/subject sequences can be stored in a single file respectively (recommended).
  2. For CUDASW++ 2.0, two models are supported: simt and smid. The simt model uses the optimized SIMT Smith Waterman algorithm, which is independent of the scoring scheme used. The simd model uses the partitioned vectorized Smith Waterman algorithm, which is kind of sensitive to the scoring scheme used.
  3. For CUDASW++ 3.0, users can use either a query profile or a query profile variant. When the L2 cache is larger (e.g. more than 512 K), we recommend using the query profile variant. For shorter queries, the query profile variant is the better.
  4. Supported scoring matrix names: blosum45, blosum50, blosum62 and blosum80. if the scoring matrix is not specified or not supported, blosum62 is used by default
  5. The default gap open penalty is 10 and gap extension penalty is 2.
  6. When using a single GPU (option -use_single), you can specifiy the index of the single GPU used for the comptuation. The index starts from 0.

Change Log


If any questions or improvements, please feel free to contact Liu, Yongchao.