如何從Fasta文件或直接從Internet獲取重複區域


0

如果我去to ncbi site,則應用customize view -> Customize -> All features -> Update view,然後再應用ctrl + f : Alu,我可以從該染色體上獲得所有鋁重複區域的坐標。

它看起來像:

 repeat_region   complement(37490..37800)
                 /note="AluY"
                 /rpt_family="SINE/Alu"
                 /rpt_type=DISPERSED

如何將所有這些坐標和重複區域的名稱從Internet下載到.txt文件?或一些工具

1

Click on the 'GenBank' dropdown menu at the top left corner of this page to see a graphical view of the sequence. You should see a 'Repeat region' track in the Sequence Viewer. Click on the 'Download' button in the top right corner of the Sequence Viewer, choose 'Download Track Data' to launch a dialog box where you can enter a range, select the 'Repeat region' track and download data in BED format. Unfortunately, the names for the features are not great.

If you want the names to be the same ones in the flat file, you can download the feature table and parse it. To download the feature table, click on the 'Send To' link at the top right corner of this page, choose 'Complete Record', 'File' and 'Feature Table' as 'Format'. After you download the feature table, you can parse it using awk as shown below to create a 4-column table with the data.

cat seq.ft \
  | awk 'BEGIN{FS="\t";OFS="\t"}{
    if(NF==3) {print s,e,f,q ; s=$1; e=$2; f=$3; q=""} 
    else if(NF==2) {print s,e,f,q ; s=$1; e=$2; q=""} 
    else if(NF==5) {q=q"|"$4"="$5} }' \
  > feat_table.tsv