GET THE APP

Local Optimization for Chromosome-Level Assembly (LOCLA)
..

Molecular and Genetic Medicine

ISSN: 1747-0862

Open Access

Research Article - (2023) Volume 17, Issue 4

Local Optimization for Chromosome-Level Assembly (LOCLA)

Wei-Hsuan Chuang1*, Hsueh-Chien Cheng1, Pao-Yin Fu1, Yi-Chen Huang1, Ping-Heng Hsieh1, Shu-Hwa Chen2, Pui-Yan Kwok3, Chung-Yen Lin1, Jan-Ming Ho1 and Yu-Jung Chang4
*Correspondence: Wei-Hsuan Chuang, Institute of Information Science, Academia Sinica, Taipei, Taiwan, Tel: 886975218950, Email:
1Institute of Information Science, Academia Sinica, Taipei, Taiwan
2TMU Research Center of Cancer Translational Medicine, Taipei Medical University, Taipei, Taiwan
3Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
4Ocean Data Bank, Institute of Oceanography, National Taiwan University, Taipei, Taiwan

Received: 01-Jul-2023, Manuscript No. jmgm-23-104990; Editor assigned: 02-Jul-2023, Pre QC No. P-104990; Reviewed: 18-Jul-2023, QC No. Q-104990; Revised: 24-Jul-2023, Manuscript No. R-104990; Published: 31-Jul-2023 , DOI: 10.37421/1747-0862.2023.17.613
Citation: Chuang, Wei-Hsuan, Hsueh-Chien Cheng, Yu-Jung Chang and Pao-Yin Fu, et al. “Local Optimization for Chromosome-Level Assembly (LOCLA).” J Mol Genet Med 17 (2023): 613.
Copyright: © 2023 Chuang WH, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

In this paper, we introduce a novel genome assembly optimization tool named LOCLA. It identifies reads aligned locally with high quality on gap flanks or scaffold boundaries, and assembles them into contigs for gap filling or scaffold connection. LOCLA enhances the quality of an assembly based on reads of diverse sequencing techniques, either 10x Genomics (10xG) Linked-Reads, PacBio HiFi reads or both. For example, with 10xG Linked-Reads, the long-range information provided by barcodes allows LOCLA to recruit additional reads belonging to the same gDNA molecule, resulting in accurate gap filling and increased sequence coverage.

In our experiments, we started by creating a preliminary draft assembly for each dataset using assembly tools such as Supernova and Canu assembler based on the type of sequencing reads. The preliminary draft assembly could either be a de novo assembly or a reference-based assembly. Then, we performed LOCLA on the assembly generally in the order of gap filling and then scaffolding. We validated LOCLA on four datasets, including three human samples and one non-model organism. For the first human sample (LLD0021C) and the non-model organism (B. sexangula), draft assemblies were generated with Supernova assembler using only 10xG Linked-Reads. We showed that LOCLA improved the draft assembly of LLD0021C by adding 23.3 million bases, which covered 28,746 protein coding regions, particularly in pericentromeric and telomeric regions. As for B. sexangula, LOCLA enhanced the assembly published by Pootakham W, et al. and by decreasing 41.4% of its gaps.

For the second human sample, the HG002 (NA24385) cell line, we mainly utilized PacBio HiFi reads. In contrast to the first human sample, we experimented on reference-based assemblies instead of de novo assemblies. We employed the RagTag reference-guided scaffolding tool to generate two draft assemblies and then filled gaps with LOCLA. The results indicated that LOCLA's candidate contig detection algorithm on gap flanks was robust, as it was able to recover a number of contigs that RagTag had not utilized, which were 27.9 million bases (22.26%) and 35.7 million bases (30.93%) for the two assemblies respectively. To evaluate the accuracy of the LOCLA-filled assemblies, we aligned them to the maternal haploid assembly of HG002 published by the Human Pan-genome Reference Consortium. We demonstrated that 95% of all sequences filled in by LOCLA have over 80% of similarity to the reference.

The third human dataset included 10x G Linked-Reads and PacBio HiFi reads of the CHM13 cell line. By utilizing reads of both sequencing techniques through gap filling and scaffolding modules of LOCLA, we added 46.2 million bases to the Supernova assembly. The additional content enabled us to identify genes linked to complex diseases (e.g., ARHGAP11A) and critical biological pathways.