在不被拒絕訪問的情況下,從Python中的SRA獲取大量RNA序列數據的最佳方法是什麼?


1

我有以下代碼使用Python中的多線程從SRA下載數據。現在運行幾次(出於測試目的)後,我一直拒絕訪問數據。不確定如何解決此問題。特別地,cating輸出文件將給出:

<?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>cb0f0e98-cafb-1dd7-9b7b-d8c49756ec52</RequestId><HostId>QeDGVwBXYp61J0B4_OUTn7UsEsiQEec0n18DAeR0kaE</HostId></Error>

這是我的代碼:

import threading
import requests
import time 

start = time.perf_counter()

class MyThread(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url
        self.result = None
        self.filename = url.split('/')[-1]
    def run(self):
        res = requests.get(self.url)
        with open(self.filename, 'wb') as f:
            f.write(res.content)
urls = [
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000001/SRR000001.1',
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000001/SRR000001.2',
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000002/SRR000002.1',
        'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR000002/SRR000002.2']

threads = [MyThread(url, ) for url in urls]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

finish = time.perf_counter()

print(f'Finished in {round(finish-start, 2)} second(s)')
0

I once needed to download more than 1000 files from SRA, and I found a blog post that is very useful (https://reneshbedre.github.io/blog/fqutil.html). The idea is to use aspera to fetch SRA files. This is fast becasue of aspera. Then use fasterq-dump to convert SRA file to fastq. fasterq-dump is faster than fastq-dump partly because it does not compress the output file. I think compressing is the rate-limiting step here. After this, you can use pigz, which is a multi-threading version of gzip to compress the fastq files. Note that if you want to download files to a computing cluster, you can submit jobs for fasterq-dump and pigz. This makes things much faster. You cannot submit jobs for fetching SRA, because you need to connect to internet.