Perl count genes

目录 Perl

经常遇到要统计fasta格式基因组的各项指标,如以下test.fa:

>orf00001
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGAGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTCGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAGGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>orf00002
ATGATTCATTTTTCAATAAATAAAAATTTCTTCTTGCATGCTCTAACGGTAACCAAACGAGCTATTAGTCATAAAAATGCGATTCCAATCCTTTCAACTGTTAAAATAGAAGTGACTAGAGATGCTATCATTTTAACGGGGTCAAATGGACAAATTTCAATTGAAAATACTATTCCTGCTTCAAATGAAAACGCAGGTTTACTAGTAACGAATCCAGGCTCTATTTTGTTAGAAGCTGGTTTCTTTATTAATATTATTTCAAGTTTACCAGATGTAACTTTAGAATTTACAGAGATTGAACAACATCAAATTGTTCTTACTAGTGGAAAATCAGAGATTACTTTGAAAGGTAAGGATGTCGATCAATACCCTCGTCTGCAGGAAATGACAACAGATACTCCATTAACATTAGAAACTAAACTGTTAAAATCAATTATTAATGAAACTGCTTTTGCTGCTAGCCAACAAGAAAGCCGTCCAATCTTAACAGGTGTTCATTTGGTTATCAGTCAAAATAAATACTTTAAGGCTGTTGCGACAGATTCACACCGTATGAGTCAACGCACTTTCCAATTAGAGAAATCGGCTAATAATTTTGATTTGGTTGTTCCAAGTAAATCCCTTCGAGAATTTTCGGCTGTTTTTACAGATGATATTGAAACTGTAGAGGTTTTCTTCTCAGATAGTCAAATGTTATTTAGAAGTGAAAATATCAGCTTCTATACACGTTTGCTTGAAGGAAACTACCCTGATACTGATCGCCTCCTAACTAATCAGTTTGAAACCGAAATTATCTTTAATACAAATGCTTTACGCCATGCTATGGAACGTGCTTATTTAATTTCGAATGCAACTCAGAACGGTACTGTTCGTTTAGAAATTCAAAATGAAACAGTCTCAGCTCATGTAAACTCTCCAGAAGTTGGTAAAGTTAATGAGGAATTGGATACTGTTAGCCTTAAAGGTGATAGTTTAAATATTAGTTTTAATCCAACTTACCTAATTGAATCTTTAAAAGCAGTAAAAAGCGAAACAGTTACGATTCGATTTATTTCTCCAGTACGTCCATTTACTTTGACACCTGGTGAAGATACTGAAGATTTCATACAATTAATAACTCCTGTTCGTACTAACTAA
>orf00003
ATGATGAGTAATATGACTCTATATATAATAGCTAACCCCCATGCTGGTAATAAAAATGCCTCCACTATTGTTGGTCAAATTCAGGAGTTTTATCATACTGAAGATATTTCTGTGTTCTATACAGAACAGAAAGATGATGAAAAAAAACAAGTCATTAATATACTAAGGTCTTTTAAAGAAAGTGATCATCTAATGATTATAGGAGGAGATGGTACCTTATCAAAAGTAATGACTTATCTCCCCAACATATTCCGTGCGCTTATTATCCTGTTGGTTCGGGAAATGATTTTGCCAGAGCTTTGA

可用perl脚本count.pl来实现:

#!/usr/bin/perl
use strict;
use warnings;

#Usage:perl count.pl $ARGV[0] $ARGV[1]

my $total_len=0;
my $total_num=0;
my $total_gap=0;
my $total_gc=0;
my @len_array=();

open IN,">$ARGV[0]",or die "can't open the file $!\n"; #genome file
open OUT,">$ARGV[1]"; #result file

print OUT "ID\tGeneLength(bp)\tGapLength(bp)\tGCContent(%)\n";

$/=">";
<IN>;
while(<IN>){
next if (/^\s+$/);
chomp;
my ($id,$seq)=(split /\n/,$_,2)[0,1];
$seq=~ s/\n//g;
my $len=length ($seq);
push @len_array,$len;
$total_len+=$len;
my $gc+=($seq =~ s/G/G/g+$seq=~ s/C/C/g);
my $GC=($gc/$len)*100;
my $gap+=($seq=~ s/N/N/g);
$total_gap+=$gap;
$total_gc+=$gc;
print OUT "$id\t$len\t$gap\t";
printf OUT "%.2f\n","$GC";
$total_num++;
}

my $avg_len=($total_len/$total_num);
my $total_GC=($total_gc/$total_len)*100;
my @sort_len=sort {$a <=> $b} @len_array;

print OUT "\nTotal Stat:\n";
print OUT "Total Scaffolds Number (#):$total_num\n";
print OUT "Total Length (bp):$total_len\n";
print OUT "Gap Length (N)(bp):$total_gap\n";
printf OUT "Average Length (bp):%d\n","$avg_len";
print OUT "Minimum Length (bp):$sort_len[0]\n";
print OUT "Maximum Length (bp):$sort_len[-1]\n";
printf OUT "GC content(%%):%.2f\n","$total_GC";

close IN;
close OUT;

命令终端运行以下命令:

perl count.pl test.fa test.result.txt

便可得到以下结果test.result.txt:

ID GeneLength(bp) GapLength(bp) GCContent(%)
orf00001 1410 0 35.18
orf00002 1137 0 33.42
orf00003 303 0 33.66

Total Stat:
Total Scaffolds Number (#):3
Total Length (bp):2850
Gap Length (N)(bp):0
Average Length (bp):950
Minimum Length (bp):303
Maximum Length (bp):1410
GC content(%):34.32