比較對齊的氨基酸和密碼子


1
我有一組4個氨基酸,它們已經比對,但是我想將它們分別與改變密碼子水平的核苷酸序列進行比較。

我有一個比對文件,我只想在C末端結構域區域中使用,只有CTD以氨基酸“NITNLC”開頭,直到以“HAPATV”結束,其餘序列我都不想採用。

簡而言之,我想採用從" NITNLC"到" HAPATV"開頭的部分序列,然後我想比較密碼子水平核苷酸序列的變化。

  1. 首先,我使用seqinr庫從比對的蛋白質序列生成核苷酸比對。

    reverse.align {seqinr}

  2. 然後下一步,我想將我的每個氨基酸序列與我的參考序列(在我的情況下為P0DTC2)比較到其餘序列,例如,我想將P0DTC2K9N5Q8與上述CTD區域進行比較從NITNLC開始到HAPATV結束,如果我想報告氨基酸水平和密碼子水平,請找出密碼子水平的變化是什麼。

第一部分我想我的方法是正確的。第二部分我不確定如何進行,我想這遠不止是簡單解析。

任何幫助或建議將不勝感激,並且如果存在基於R的解決方案將非常受歡迎。

文件Amino acid file nucleotide file Aligned amino acid file Reverse alignment nucleotide file using seqinr

2

To set the ball rolling, I tried to implement it manually with your files. Overall, there are three steps:

  1. Locate NITNLC (or HAPATV) in protein P0DTC2.
  2. Locate mismatched amino acids between proteins P0DTC2 and K9N5Q8 within the range from step 1.
  3. Print the amino acid and the DNA codon of both proteins from step 2.

It works, but only for the first 60 amino acids. I wonder if it's due to the amino sequences in AMINOOO_seq_removed.fasta repeat itself every 60 acids. But why?

#For example, the first three lines of protein P0DTC2
>P0DTC2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
...

Step 0: Read the files.

library('seqinr')

#align file containing protein sequences
count_added <- read.alignment('count_added_.clustal_num', format='clustal')
names(count_added$seq) <- count_added$nam

#DNA sequences
rev3.aln <- read.alignment('rev3.aln', format='fasta')
names(rev3.aln$seq) <- rev3.aln$nam

Step 1: Locate NITNLC (or HAPATV) in protein P0DTC2. Again, there are two repeats of NITNLC 60 acids apart (at 807 and 870).

locate <- function(seq, find)
  {address <- gregexpr(paste(strsplit(find, '')[[1]], collapse='[^a-z]*'), seq)
  #substr(seq, address[[1]][1], address[[1]][1]+attr(address[[1]], 'match.length')[1]-1)

  return(list(start=as.numeric(address[[1]]), 
             end=as.numeric(address[[1]] + attr(address[[1]], 'match.length') - 1)))
  }

locate(seq=count_added$seq[['P0DTC2']], find='nitnlc')
#start
#807 870
#end
#812 875
locate(seq=count_added$seq[['P0DTC2']], find='hapatv')
#start
#1212 1279
#end
#1217 1285

Step 2: Locate mismatched amino acids between proteins P0DTC2 and K9N5Q8 within the range from step 1. Instead of using the range 812 - 1279, I chose range 1 - 20 for demonstration purposes.

compare <- function(seq1, seq2, after=0, before=100000)
  {seq1_ = strsplit(seq1, '')[[1]]
  seq2_ = strsplit(seq2, '')[[1]]
  ind = which(seq1_ != seq2_ & grepl('[a-z]',seq1_) & grepl('[a-z]',seq2_))
  ind = ind[ind>after & ind<before]
  #seq1_[ind[1]]
  #seq2_[ind[1]]

  return(ind)
  }
compare(seq1=count_added$seq[['P0DTC2']], seq2=count_added$seq[['K9N5Q8']], after=1, before=20)
# [1] 5  7  8  9 10 13 14

#protein comparison
#K9N5Q8      MIHSVFLLMFLLTPTESYVD
#P0DTC2      ----MFVFLVLLPL------
#<mismatch>      5 7890  34

Step 3: Print the amino acid and the DNA codon of both proteins from step 2. Note that ind is based on the align file.

print_amino_codon <- function(ind, seq, seq_gene)
  {locate_amino <- gregexpr('[a-z]', seq)[[1]]
  if (!ind %in% locate_amino) return(NA)
  ind2 = match(ind, locate_amino)

  return(c(amino=substr(seq, ind, ind), codon=substr(seq_gene, ind2*3-2, ind2*3)))
  }
 
codon(ind=5, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "v" "atg" 
codon(ind=6, vseq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "f" "ttc" 
codon(ind=7, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "l" "ttg"

codon(ind=5, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "m" "atg" 
codon(ind=6, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "f" "ttt" 
codon(ind=7, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "v" "gtt" 

#protein comparison
#K9N5Q8      MIHSVFLLMFLLTPTESYVD
#P0DTC2      ----MFVFLVLLPL------
#<mismatch>      5 7890  34
#<print>         ^^^

#K9N5Q8 gene codon
#gtg ttt cta ctg atg ttc ttg tta aca
#                ^^^ ^^^ ^^^
#P0DTC2 gene codon
#atg ttt gtt ttt ctt
#^^^ ^^^ ^^^