# 如何將基因組片段化成不同大小的非重疊序列？

You could use `numpy.random.randint` to sample integers from a uniform distribution, from 1000 to 15000 nt, extracting subsequences of that length from a linearized multi-FASTA file.

Here's some code:

``````#!/usr/bin/env python

import sys
import numpy as np

low = int(sys.argv[1])
high = int(sys.argv[2])

next = 0
fragment = 1
while next < len(sequence):
new_next = np.random.randint(low, high) + next
if new_next > len(sequence) - high:
new_next = len(sequence)
next = new_next
fragment += 1

sequence = ""
for line in sys.stdin:
if line.startswith('>'):
if len(sequence) > 0:
sequence = ""
else:
sequence += line.rstrip()

if len(sequence) > 0:
``````

To use it:

``````\$ ./13879.py 1000 15000 < in.fa > out.fa
``````

Fragments will be disjoint (non-overlapping, contiguous) subsequences.

I add a check to determine if the last fragment of a FASTA record would be shorter than `low` (1kb) in length, and, if so, append that fragment to the final subsequence. It is possible for that last fragment to be up to 16 kb - 1 nt in length, given your parameters:

``````if new_next > len(sequence) - high:
new_next = len(sequence)
``````

You could change this logic to discard the last fragment (i.e. `break`), or simply allow fragments less than 1kb in size, depending on what you need.

EDIT

Here's a slight modification that should allow to you preserve fragment sizes within your bounds.

A cartoon may help illustrate the problem. Let's set up a FASTA record, which has been broken up into `n` fragments:

``````   1       2           n-2      n-1      n
|-----|---------|...|-------|---------|-----|
``````

We may get unlucky and sample a random integer for fragment `n-1` (`frag(n-1)`) which makes the length of `frag(n)` less than `low` (1 kb in your example).

In my first suggested solution, `frag(n-1)` and `frag(n)` are appended to one another. This is why you get some fragments longer than `high` (15 kb).

Another option is to discard `frag(n)`, which is of length less than `low` (1 kb). Here's some code:

``````#!/usr/bin/env python

import sys
import numpy as np
import timeit

low = int(sys.argv[1])
high = int(sys.argv[2])
in_fn = sys.argv[3]

"""
Globals
"""
fragment_lengths = []

"""
Process a record
"""
next = 0
last = 0
fragment = 1
fragment_sequence = ""
while last < len(sequence) - high:
next = np.random.randint(low, high + 1) + last
fragment_sequence += sequence[last:next]
# Comment the sys.stdout.write and uncomment the fragment_lengths line, if measuring fragment length distributions
#fragment_lengths.append(next - last)
last = next
fragment += 1
"""
Uncomment to compare input sequence with fragments
"""
#sys.stderr.write('In > {}\nOut > {}\n'.format(sequence, fragment_sequence))

"""
Parse records
"""
def parse_records(in_fn, low, high):
sequence = ""
with open(in_fn, 'r') as ifh:
for line in ifh:
if line.startswith('>'):
if len(sequence) > 0:
sequence = ""
else:
sequence += line.rstrip()
if len(sequence) > 0:

def count_elements(seq):
hist = {}
for i in seq:
hist[i] = hist.get(i, 0) + 1
return hist

def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped

def main(in_fn, low, high):
"""
Comment below if testing the distribution of fragments
"""
parse_records(in_fn, low, high)
"""
Uncomment to test the distribution of fragments from
running fns on same input 100k times
"""
#wrapped = wrapper(parse_records, in_fn, low, high)
#timeit.timeit(wrapped, number=100000)
#print(count_elements(fragment_lengths))

if __name__ == "__main__":
main(in_fn, low, high)
``````

If `frag(n)` is needed, and you need to preserve the uniform distribution (or whatever fragment length distribution you want to model), then rejection sampling might be an option.

With rejection sampling, you keep fragmenting your input sequence, until one of the distributions of lengths gets you a set of fragments that entirely cover the input sequence.

If you need that, follow up and I'll take another look.

I've previously written my own script to fragment DNA sequences (input as either fastq or fasta files) into identical-length fragments, with a given overlap. I used this because there was a particular assembler that demanded equal-length sequences, and I was working with nanopore reads, but it's since been useful for other purposes as well.

Because it is most concerned about making lengths identical, there may be some forced overlap when setting the overlap to zero. The overlap will only be exactly zero when the sequence length is an exact multiple of the fragment size. Here's an example use:

## zero overlap

``````normalise_seqlengths.pl -fraglength 25 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 6 fragments, 0 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGT
+
FFFFFFFFFFFFFFFFF:,FFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:25-50  1:N:0:CGCAAGAA+ACTCCTAA
GGGAGCTGTCTGGGGTAACCTGGCT
+
:FFFF:F:FFFFF::FFFFF,::FF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:50-75  1:N:0:CGCAAGAA+ACTCCTAA
CTAGGACCAGGTGGAAAAAGCACCT
+
F:FFFFFFFFFFFFFFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#3:75-100  1:N:0:CGCAAGAA+ACTCCTAA
TGGCTGGCCCAGCCCTGTTCTTCTC
+
:FF,FFFFF,:,FFF::F:FFF,:F
@A00187:180:HLGNGDSXX:3:1101:8486:1031#4:100-125  1:N:0:CGCAAGAA+ACTCCTAA
CTGGGGAGGGAAGGGCAGGTCCCCT
+
F:FFFFFFFFF,FFFFF:FFF,F,,
@A00187:180:HLGNGDSXX:3:1101:8486:1031#5:125-150  1:N:0:CGCAAGAA+ACTCCTAA
AATGCCTGAACCACATGGTGGGCCA
+
FFF,F:FFFFFFF:F,,,FFF,F,,
``````

## partial overlap

``````normalise_seqlengths.pl -fraglength 63 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 3 fragments, 20 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGTGGGAGCTGTCTGGGGTAACCTGGCTCTAGGACCAGGTG
+
FFFFFFFFFFFFFFFFF:,FFFFFF:FFFF:F:FFFFF::FFFFF,::FFF:FFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:43-106  1:N:0:CGCAAGAA+ACTCCTAA
CCTGGCTCTAGGACCAGGTGGAAAAAGCACCTTGGCTGGCCCAGCCCTGTTCTTCTCCTGGGG
+
FF,::FFF:FFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFF,:,FFF::F:FFF,:FF:FFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:87-150  1:N:0:CGCAAGAA+ACTCCTAA
CCCTGTTCTTCTCCTGGGGAGGGAAGGGCAGGTCCCCTAATGCCTGAACCACATGGTGGGCCA
+
FFF::F:FFF,:FF:FFFFFFFFF,FFFFF:FFF,F,,FFF,F:FFFFFFF:F,,,FFF,F,,
``````