如何將基因組片段化成不同大小的非重疊序列?


2

我有細菌染色體/質粒,我已從refseq中下載了多個Fasta,我希望將它們片段化為1 kb和15 kb之間的非重疊片段。

我嘗試使用Google搜索相關工具,但無濟於事。有誰知道我將如何實現這一目標?本文完成了一些非常相似的工作:https://link.springer.com/article/10.1007/s40484-019-0187-4

謝謝!

3

You could use numpy.random.randint to sample integers from a uniform distribution, from 1000 to 15000 nt, extracting subsequences of that length from a linearized multi-FASTA file.

Here's some code:

#!/usr/bin/env python

import sys
import numpy as np

low = int(sys.argv[1])
high = int(sys.argv[2])

def process_record(header, sequence, low, high):
    next = 0
    fragment = 1
    while next < len(sequence):
        new_next = np.random.randint(low, high) + next  
        if new_next > len(sequence) - high:
            new_next = len(sequence)
        sys.stdout.write('{}:{}\n{}\n'.format(header, fragment, sequence[next:new_next]))
        next = new_next
        fragment += 1

header = ""
sequence = ""
for line in sys.stdin:
    if line.startswith('>'):
        if len(sequence) > 0:
            process_record(header, sequence, low, high)
            sequence = ""
        header = line.rstrip()
    else:
        sequence += line.rstrip()

if len(sequence) > 0:
    process_record(header, sequence, low, high)

To use it:

$ ./13879.py 1000 15000 < in.fa > out.fa

Fragments will be disjoint (non-overlapping, contiguous) subsequences.

I add a check to determine if the last fragment of a FASTA record would be shorter than low (1kb) in length, and, if so, append that fragment to the final subsequence. It is possible for that last fragment to be up to 16 kb - 1 nt in length, given your parameters:

if new_next > len(sequence) - high:
    new_next = len(sequence)

You could change this logic to discard the last fragment (i.e. break), or simply allow fragments less than 1kb in size, depending on what you need.

EDIT

Here's a slight modification that should allow to you preserve fragment sizes within your bounds.

A cartoon may help illustrate the problem. Let's set up a FASTA record, which has been broken up into n fragments:

   1       2           n-2      n-1      n
|-----|---------|...|-------|---------|-----|

We may get unlucky and sample a random integer for fragment n-1 (frag(n-1)) which makes the length of frag(n) less than low (1 kb in your example).

In my first suggested solution, frag(n-1) and frag(n) are appended to one another. This is why you get some fragments longer than high (15 kb).

Another option is to discard frag(n), which is of length less than low (1 kb). Here's some code:

#!/usr/bin/env python

import sys
import numpy as np
import timeit

low = int(sys.argv[1])
high = int(sys.argv[2])
in_fn = sys.argv[3]

"""
Globals
"""
fragment_lengths = []

"""
Process a record
"""
def process_record(header, sequence, low, high):
    next = 0
    last = 0
    fragment = 1
    fragment_sequence = ""
    while last < len(sequence) - high:
        next = np.random.randint(low, high + 1) + last
        fragment_sequence += sequence[last:next]
        # Comment the sys.stdout.write and uncomment the fragment_lengths line, if measuring fragment length distributions 
        sys.stdout.write('{}:{}-{}:{}\n{}\n'.format(header, last, next, fragment, sequence[last:next]))
        #fragment_lengths.append(next - last)                                                                                                                                                                                                                                                                                                                                                                                          
        last = next
        fragment += 1
    """
    Uncomment to compare input sequence with fragments
    """
    #sys.stderr.write('In > {}\nOut > {}\n'.format(sequence, fragment_sequence))
    
"""
Parse records
"""
def parse_records(in_fn, low, high):
    header = ""
    sequence = ""
    with open(in_fn, 'r') as ifh:
        for line in ifh:
            if line.startswith('>'):
                if len(sequence) > 0:
                    process_record(header, sequence, low, high)
                    sequence = ""
                header = line.rstrip()
            else:
                sequence += line.rstrip()
        if len(sequence) > 0:
            process_record(header, sequence, low, high)

def count_elements(seq):
    hist = {}
    for i in seq:
        hist[i] = hist.get(i, 0) + 1
    return hist

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def main(in_fn, low, high):
    """
    Comment below if testing the distribution of fragments
    """
    parse_records(in_fn, low, high)
    """
    Uncomment to test the distribution of fragments from 
    running fns on same input 100k times
    """
    #wrapped = wrapper(parse_records, in_fn, low, high)
    #timeit.timeit(wrapped, number=100000)
    #print(count_elements(fragment_lengths))

if __name__ == "__main__":
    main(in_fn, low, high)

If frag(n) is needed, and you need to preserve the uniform distribution (or whatever fragment length distribution you want to model), then rejection sampling might be an option.

With rejection sampling, you keep fragmenting your input sequence, until one of the distributions of lengths gets you a set of fragments that entirely cover the input sequence.

If you need that, follow up and I'll take another look.


1

I've previously written my own script to fragment DNA sequences (input as either fastq or fasta files) into identical-length fragments, with a given overlap. I used this because there was a particular assembler that demanded equal-length sequences, and I was working with nanopore reads, but it's since been useful for other purposes as well.

Because it is most concerned about making lengths identical, there may be some forced overlap when setting the overlap to zero. The overlap will only be exactly zero when the sequence length is an exact multiple of the fragment size. Here's an example use:

zero overlap

normalise_seqlengths.pl -fraglength 25 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 6 fragments, 0 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGT
+
FFFFFFFFFFFFFFFFF:,FFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:25-50  1:N:0:CGCAAGAA+ACTCCTAA
GGGAGCTGTCTGGGGTAACCTGGCT
+
:FFFF:F:FFFFF::FFFFF,::FF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:50-75  1:N:0:CGCAAGAA+ACTCCTAA
CTAGGACCAGGTGGAAAAAGCACCT
+
F:FFFFFFFFFFFFFFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#3:75-100  1:N:0:CGCAAGAA+ACTCCTAA
TGGCTGGCCCAGCCCTGTTCTTCTC
+
:FF,FFFFF,:,FFF::F:FFF,:F
@A00187:180:HLGNGDSXX:3:1101:8486:1031#4:100-125  1:N:0:CGCAAGAA+ACTCCTAA
CTGGGGAGGGAAGGGCAGGTCCCCT
+
F:FFFFFFFFF,FFFFF:FFF,F,,
@A00187:180:HLGNGDSXX:3:1101:8486:1031#5:125-150  1:N:0:CGCAAGAA+ACTCCTAA
AATGCCTGAACCACATGGTGGGCCA
+
FFF,F:FFFFFFF:F,,,FFF,F,,

partial overlap

normalise_seqlengths.pl -fraglength 63 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 3 fragments, 20 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGTGGGAGCTGTCTGGGGTAACCTGGCTCTAGGACCAGGTG
+
FFFFFFFFFFFFFFFFF:,FFFFFF:FFFF:F:FFFFF::FFFFF,::FFF:FFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:43-106  1:N:0:CGCAAGAA+ACTCCTAA
CCTGGCTCTAGGACCAGGTGGAAAAAGCACCTTGGCTGGCCCAGCCCTGTTCTTCTCCTGGGG
+
FF,::FFF:FFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFF,:,FFF::F:FFF,:FF:FFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:87-150  1:N:0:CGCAAGAA+ACTCCTAA
CCCTGTTCTTCTCCTGGGGAGGGAAGGGCAGGTCCCCTAATGCCTGAACCACATGGTGGGCCA
+
FFF::F:FFF,:FF:FFFFFFFFF,FFFFF:FFF,F,,FFF,F:FFFFFFF:F,,,FFF,F,,