Hash sort

目录 Perl

perl语言中hash排序可以按照key和value排序,各自又可分为按照数字大小和ASCII排序。如果$a和$b颠倒,则为倒序排列。

key排序
数字排序
foreach $key (sort {$a <=> $b} keys %hash)
ASCII排序
foreach $key (sort {$a cmp $b} keys %hash)
value排序
数字排序
foreach $key ( sort {$hash{$a} <=> $hash{$b}} keys %hash )
ASCII排序
foreach $key ( sort {$hash{$a} cmp $hash{$b}} keys %hash )

array

目录 Perl

转自脚本之家
1、去除一个数组中的重复元素:
使用grep函数代码片段:

my @array = ( 'a', 'b', 'c', 'a', 'd', 1, 2, 5, 1, 5 );
my %count;
my @uniq_times = grep { ++$count{ $_ } < 2; } @array;

使用转换hash代码片段:

my @array = ( 'a', 'b', 'c', 'a', 'd', 1, 2, 5, 1, 5 );
my %saw;
@saw{ @array } = ( );
my @uniq_array = sort keys %saw;

2、合并两个array:

push @array1, @array2;

3、快速查找最大值,不知道的程序猿们,这样搞:

my @nums = 0 .. 1000;
my $max = $nums[0];
foreach (@nums) {
    $max = $_ if $_ > $max;
}

知道的这样搞:

use List::Util qw(max);
my $max_num = max( 0 .. 1000 );

知道的他们还这样搞:

use List::Util qw(maxstr);
my $max_str = maxstr ( qw( Fido Spot Rover ) );

字符串比较玩弄于掌中。还有sum:

use List::Util qw(sum);
my $sum = sum ( 1 .. 1000 );

4、列表归并
数字求和,也可以用List::Util中的reduce:

use List::Util qw(reduce);
my $sum = reduce { $a + $b } 1 .. 1000;

与sort类似,reduce也是用code block作为参数,不过运行机制稍微不同。每次迭代,先从参数列表取出前面两个元素,分别设置为别名$a和$b,这样参数列表的长度就会缩短为两个元素。然后reduce把语句块返回的计算结果再压回到参数列表的头部。如此往复,直到最后列表里只剩下一个元素,也就是迭代的计算结果$sum。
好了,可以这样了:

my $product = reduce { $a * $b } 1 .. 1000;

5、判断是否有元素匹配
纯粹用Perl实现,找到列表中第一个符合某条件的元素,比找出所有符合条件的要麻烦一些。下面的例子,判断是否有大于1000的元素:

my $found_a_match = grep { $_ > 1000 } @list;

注意:如果@list有一亿个元素,而要找的就是1001?grep仍然还会循环一亿次,当然你可以向下面自己控制下:

my $found_a_match = 0;
foreach my $elem (@list) {
    $found_a_match = $elem if $elem > 1000;
    last if $found_a_match;
}

还是那句话,不简单~~~List::Util有现成的东西:

use List::Util qw(first);
my $found_a_match = fist { $_ > 1000 } @list;

在List::MoreUtils模块中,也提供很多的实用函数:

my $found_a_match = any { $_ > 1000 } @list;
my $all_greater = all { $_ > 1000 } @list;
my $none_greater = none { $_ > 1000 } @list;
my $all_greater = notall { $_ % 2 } @list;

6、一次遍历多个列表
一般我们同时遍历多个业务相关的列表时,往往用数组下标遍历:

my @a = ( ... );
my @b = ( ... );
my @c;
foreach my $i ( 0 .. $#list ) {
    my ( $a, $b ) = ( $a[$i], $b[$i] );
    push @c, $a + $b;
}

看下面这个,你的感觉是?

use List::MoreUtils qw(pairwise);
my @c = pairwise { $a + $b } @a, @b;

pairwise只适合两个列表的同步计算,三个后用each_array:

use List::MoreUtils qw(each_array);
my $ea = each_array( @a, @b, @c );
my @d;
while ( my ( $a, $b, $c ) = $ea->() ) {
    push @d, $a+$b+$c;
}

虽然还是有点烦,不过也还好了。
7、数组合并
合并多个数组的操作当然你可以自己写,但终究不如MoreUtils的mesh方便:

use List::MoreUtils qw(mesh);
my @odds = qw/ 1 3 5 7 9/;
my @evens= qw/ 2 4 6 8 0/;
my @nums = mesh @odds, @evens; # print: 1 2 3 4 ...

filter repeat element from array

目录 Perl

转自Perl Maven
在本节Perl教程中我们会看到如何使数组仅包含不同的值。
Perl 5没有过滤数组重复值的内建函数,但还是有很多解决方案。
List::MoreUtils
对于你的问题,可能最简单的方式是使用List::MoreUtils模块的uniq函数。

use List::MoreUtils qw(uniq);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq @words;

完整示例:

use strict;
use warnings;
use 5.010;
use List::MoreUtils qw(uniq);
use Data::Dumper qw(Dumper);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq @words;
say Dumper \@unique_words;

结果:

$VAR1 = [
  'foo',
  'bar',
  'baz',
  'zorg'
];

另外,该模块还提供了uniq的别名函数——distinct。
你得利用CPAN先安装这个模块后才能使用。
自制 uniq
如果不能安装上面的模块,或者你认为加载模块的开销太大,那么可以使用一个简单表达式也能达到同样的效果:

my @unique = do { my %seen; grep { !$seen{$_}++ } @data };

这对不熟悉的人会很奇怪,所以建议自定义uniq函数,然后在代码的其他部分使用。

use strict;
use warnings;
use 5.010;
use Data::Dumper qw(Dumper);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq( @words );
say Dumper \@unique_words;
sub uniq {
  my %seen;
  return grep { !$seen{$_}++ } @_;
}

解释自制的uniq
举例之后当然不能置之不理,下面就来解释一下。先从早点的版本开始:

my @unique;
my %seen;
foreach my $value (@words) {
  if (! $seen{$value}) {
    push @unique, $value;
    $seen{$value} = 1;
  }
}

这里用普通的foreach循环一个个处理原数组中的值,处理过程中使用了辅助哈希表%seen,因为哈希表的键是唯一的。
开始的时候哈希表是空的,所以当遇到第一个”foo”的时候$seen{“foo”}并不存在,它的值是undef,也就是Perl中的false,表示之前没有见过这个值,然后把这个值添加到@uniq数组的最后。
这里把$seen{“foo”}设为1,其实任何为true的值都可以。
下一次遇到相同的字符串时,它已经作为%seen哈希表的键且对应的值是true,所以if条件判断会失败,也就不会push重复的值到数组。
简化自制的 unique 函数
首先把赋值语句$seen{$value} = 1;替换成自增操作符$seen{$value}++。这不会改变之前的方案——任何正数都被看作TRUE,但是会把设置”seen标志”的语句放在if条件判断中。区别开使用后缀自增(而不是前缀自增)是非常重要的,因为它会在布尔表达式计算完之后再自增。当我们第一次遇到某个值时(if的布尔)表达式为TRUE,之后再遇到同一个值时为FALSE。

my @unique;
my %seen;
foreach my $value (@data) {
  if (! $seen{$value}++ ) {
    push @unique, $value;
  }
}

这已经缩短了代码,但是我们还有更好的方案。
使用grep过滤重复值
Perl的grep函数是Unix的grep命令的一般化形式。
它基本上是一个过滤器。你可以在右边传入一个数组,在代码块中传入一个表达式。grep函数会一个一个的提取数组中的值到$_(Perl的默认标量)中,然后执行代码块。如果代码块求值为TRUE,对应的值会通过过滤。如果代码块求值为FALSE,当前值会被过滤掉。
于是我们得到这样的表达式:

my %seen;
my @unique = grep { !$seen{$_}++ } @words;

包裹在’do’或’sub’中
最后要做的是把上面两个语句包裹在do代码块

my @unique = do { my %seen; grep { !$seen{$_}++ } @words };

或者放在具名函数中:

sub uniq {
my %seen;
return grep { !$seen{$_}++ } @_;
}

另一个自制uniq
如果对元素的顺序没有要求的话,可以使用Prakash Kailasa建议的更短的uniq实现版本(需要perl 5.14或更高):
内联代码:

my @unique = keys { map { $_ => 1 } @data };

或者在函数里:

my @unique = uniq(@data);
sub uniq { keys { map { $_ => 1 } @_ } };

分开讲解一下:
map的语法和grep相似:一个代码块以及一个数组。它会遍历数组的所有元素,执行代码块并把结果传递给左边。
在我们的例子里,数组中的每个元素处理之后都会和数字1一起被传递。不要忘记=>(也称胖逗号)就是个逗号。假设@data是(‘a’, ‘b’, ‘a’),那表达式会返回(‘a’, 1, ‘b’, 1, ‘a’, 1)。

map { $_ => 1 } @data

如果把这个表达式赋值给一个哈希表,那么原来的数据会作为键,数字1会作为值。尝试一下:

use strict;
use warnings;
use Data::Dumper;
my @data = qw(a b a);
my %h = map { $_ => 1 } @data;
print Dumper \%h;

输出:

$VAR1 = {
  'a' => 1,
  'b' => 1
};

如果把它包裹在花括号里而不是赋值的话,我们会得到一个匿名哈希表的引用。

{ map { $_ => 1 } @data }

Let’s see it in action: 看一下实际例子:

use strict;
use warnings;
use Data::Dumper;
my @data = qw(a b a);
my $hr = { map { $_ => 1 } @data };
print Dumper $hr;

除了哈希表的dump顺序可能改变以外,这和之前的输出结果一致。
最后,从perl 5.14开始我们可以对哈希表引用调用keys函数,所以可以这样写:

my @unique = keys { map { $_ => 1 } @data };

这样可以从@data中提取唯一值。

Perl Translation protein

目录 Perl

将CDS翻译成蛋白的脚本cds2aa.pl

#!/usr/bin/perl
use strict;

if(scalar @ARGV==0){
    die "This program is used to trans cds to pep
    perl $0 <cds file> <pep>\n"
;
}

open IN,"<$ARGV[0]";
open OUT,">$ARGV[1]";

$/=">";
<IN>;

while (<IN>){
    chomp;
    next if (/^\s+$/);
    my ($id,$dna)=(split /\n/,$_,2)[0,1];
    $dna=~ s/\n//g;
    my $protein="";
    for (my $i=0;$i< (length($dna)-2);$i+=3){
        $protein.= &codon2aa (substr ($dna,$i,3));
    }
    print OUT ">$id\n";
    print OUT "$protein\n";
}
close IN;
$/="\n";

sub codon2aa {
    my ($codon)=@_;
    $codon=uc $codon;
    my(%genetic_code)=(
    'TCA'=>'S',#Serine
    'TCC'=>'S',#Serine
    'TCG'=>'S',#Serine
    'TCT'=>'S',#Serine
   
    'TTC'=>'F',#Phenylalanine;
    'TTT'=>'F',#Phenylalanine;
    'TTA'=>'L',#Leucine
    'TTG'=>'L',#Leucine
   
    'TAC'=>'Y',#Tyrosine
    'TAT'=>'Y',#Tyrosine
    'TAA'=>'', #Stop
    'TAG'=>'', #Stop
   
    'TGC'=>'C',#Cysteine
    'TGT'=>'C',#Cysteine
    'TGA'=>'', #Stop
    'TGG'=>'W',#Tryptophan
   
    'CTA'=>'L',#Leucine
    'CTC'=>'L',#Leucine
    'CTG'=>'L',#Leucine
    'CTT'=>'L',#Leucine
   
    'CCA'=>'P',#Proline
    'CCC'=>'P',#Proline
    'CCG'=>'P',#Proline
    'CCT'=>'P',#Proline
   
    'CAC'=>'H',#Histidine
    'CAT'=>'H',#Histidine
    'CAA'=>'Q',#Glutamine
    'CAG'=>'Q',#Glutamine
   
    'CGA'=>'R',#Arginine
    'CGC'=>'R',#Arginine
    'CGG'=>'R',#Arginine
    'CGT'=>'R',#Arginine
   
    'ATA'=>'I',#Isoleucine
    'ATC'=>'I',#Isoleucine
    'ATT'=>'I',#Isoleucine
    'ATG'=>'M',#Methionine
   
    'ACA'=>'T',#Threonine
    'ACC'=>'T',#Threonine
    'ACG'=>'T',#Threonine
    'ACT'=>'T',#Threonine
   
    'AAC'=>'N',#Asparagine
    'AAT'=>'N',#Asparagine
    'AAA'=>'K',#Lysine
    'AAG'=>'K',#Lysine
   
    'AGC'=>'S',#Serine
    'AGT'=>'S',#Serine
    'AGA'=>'R',#Arginine
    'AGG'=>'R',#Arginine
   
    'GTA'=>'V',#Valine
    'GTC'=>'V',#Valine
    'GTG'=>'V',#Valine
    'GTT'=>'V',#Valine
   
    'GCA'=>'A',#Alanine
    'GCC'=>'A',#Alanine
    'GCG'=>'A',#Alanine
    'GCT'=>'A',#Alanine
   
    'GAC'=>'D',#Aspartic Acid
    'GAT'=>'D',#Aspartic Acid
    'GAA'=>'E',#Glutamine Acid
    'GAG'=>'E',#Glutamine Acid
   
    'GGA'=>'G',#Glycine
    'GGC'=>'G',#Glycine
    'GGG'=>'G',#Glycine
    'GGT'=>'G',#Glycine
    );
    if (exists $genetic_code{$codon}) {
        return "$genetic_code{$codon}";
    }else{
        return "X";
    }
}

测试的CDS文件test.fa:

>orf00001
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGAGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTCGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAGGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>orf00002
ATGATTCATTTTTCAATAAATAAAAATTTCTTCTTGCATGCTCTAACGGTAACCAAACGAGCTATTAGTCATAAAAATGCGATTCCAATCCTTTCAACTGTTAAAATAGAAGTGACTAGAGATGCTATCATTTTAACGGGGTCAAATGGACAAATTTCAATTGAAAATACTATTCCTGCTTCAAATGAAAACGCAGGTTTACTAGTAACGAATCCAGGCTCTATTTTGTTAGAAGCTGGTTTCTTTATTAATATTATTTCAAGTTTACCAGATGTAACTTTAGAATTTACAGAGATTGAACAACATCAAATTGTTCTTACTAGTGGAAAATCAGAGATTACTTTGAAAGGTAAGGATGTCGATCAATACCCTCGTCTGCAGGAAATGACAACAGATACTCCATTAACATTAGAAACTAAACTGTTAAAATCAATTATTAATGAAACTGCTTTTGCTGCTAGCCAACAAGAAAGCCGTCCAATCTTAACAGGTGTTCATTTGGTTATCAGTCAAAATAAATACTTTAAGGCTGTTGCGACAGATTCACACCGTATGAGTCAACGCACTTTCCAATTAGAGAAATCGGCTAATAATTTTGATTTGGTTGTTCCAAGTAAATCCCTTCGAGAATTTTCGGCTGTTTTTACAGATGATATTGAAACTGTAGAGGTTTTCTTCTCAGATAGTCAAATGTTATTTAGAAGTGAAAATATCAGCTTCTATACACGTTTGCTTGAAGGAAACTACCCTGATACTGATCGCCTCCTAACTAATCAGTTTGAAACCGAAATTATCTTTAATACAAATGCTTTACGCCATGCTATGGAACGTGCTTATTTAATTTCGAATGCAACTCAGAACGGTACTGTTCGTTTAGAAATTCAAAATGAAACAGTCTCAGCTCATGTAAACTCTCCAGAAGTTGGTAAAGTTAATGAGGAATTGGATACTGTTAGCCTTAAAGGTGATAGTTTAAATATTAGTTTTAATCCAACTTACCTAATTGAATCTTTAAAAGCAGTAAAAAGCGAAACAGTTACGATTCGATTTATTTCTCCAGTACGTCCATTTACTTTGACACCTGGTGAAGATACTGAAGATTTCATACAATTAATAACTCCTGTTCGTACTAACTAA
>orf00003
ATGATGAGTAATATGACTCTATATATAATAGCTAACCCCCATGCTGGTAATAAAAATGCCTCCACTATTGTTGGTCAAATTCAGGAGTTTTATCATACTGAAGATATTTCTGTGTTCTATACAGAACAGAAAGATGATGAAAAAAAACAAGTCATTAATATACTAAGGTCTTTTAAAGAAAGTGATCATCTAATGATTATAGGAGGAGATGGTACCTTATCAAAAGTAATGACTTATCTCCCCAACATATTCCGTGCGCTTATTATCCTGTTGGTTCGGGAAATGATTTTGCCAGAGCTTTGA

测试命令:

perl cds2aa.pl test.fa test.pep.txt

测试结果文件test.pep.txt

>orf00001
MVQYNNNYPQDNKEEAMTENEQLFWNRVLELSRSQIAPAAYEFFVLEARLLKIEHQTAVITLDNIEMKKLFWEQNLGPVILTAGFEIFNAEITANYVSNDLHLQETSFSNYQQSSNEVNTLPIRKIDSNLKEKYTFANFVQGDENRWAVSASIAVADSPGTTYNPLFIWGGPGLGKTHLLNAIGNQVLRDNPNARVLYITAENFINEFVSHIRLDSMEELKEKFRNLDLLLIDDIQSLAKKTLGGTQEEFFNTFNALHTNDKQIVLTSDRNPNQLNDLEERLVTRFSWGLPVNITPPDFETRVAILTNKIQEYPYDFPQDTIEYLAGEFDSNVRELEGALKNISLVADFKHAKTITVDIAAEAIRARKNDGPIVTVIPIEEIQIQVGKFYGVTVKEIKATKRTQDIVLARQVAMYLAREMTDNSLPKIGKEFGGRDHSTVLHAYNKIKNMVAQDDNLRIEIETIKNKIR
>orf00002
MIHFSINKNFFLHALTVTKRAISHKNAIPILSTVKIEVTRDAIILTGSNGQISIENTIPASNENAGLLVTNPGSILLEAGFFINIISSLPDVTLEFTEIEQHQIVLTSGKSEITLKGKDVDQYPRLQEMTTDTPLTLETKLLKSIINETAFAASQQESRPILTGVHLVISQNKYFKAVATDSHRMSQRTFQLEKSANNFDLVVPSKSLREFSAVFTDDIETVEVFFSDSQMLFRSENISFYTRLLEGNYPDTDRLLTNQFETEIIFNTNALRHAMERAYLISNATQNGTVRLEIQNETVSAHVNSPEVGKVNEELDTVSLKGDSLNISFNPTYLIESLKAVKSETVTIRFISPVRPFTLTPGEDTEDFIQLITPVRTN
>orf00003
MMSNMTLYIIANPHAGNKNASTIVGQIQEFYHTEDISVFYTEQKDDEKKQVINILRSFKESDHLMIIGGDGTLSKVMTYLPNIFRALIILLVREMILPEL

Perl count genes

目录 Perl

经常遇到要统计fasta格式基因组的各项指标,如以下test.fa:

>orf00001
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGAGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTCGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAGGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>orf00002
ATGATTCATTTTTCAATAAATAAAAATTTCTTCTTGCATGCTCTAACGGTAACCAAACGAGCTATTAGTCATAAAAATGCGATTCCAATCCTTTCAACTGTTAAAATAGAAGTGACTAGAGATGCTATCATTTTAACGGGGTCAAATGGACAAATTTCAATTGAAAATACTATTCCTGCTTCAAATGAAAACGCAGGTTTACTAGTAACGAATCCAGGCTCTATTTTGTTAGAAGCTGGTTTCTTTATTAATATTATTTCAAGTTTACCAGATGTAACTTTAGAATTTACAGAGATTGAACAACATCAAATTGTTCTTACTAGTGGAAAATCAGAGATTACTTTGAAAGGTAAGGATGTCGATCAATACCCTCGTCTGCAGGAAATGACAACAGATACTCCATTAACATTAGAAACTAAACTGTTAAAATCAATTATTAATGAAACTGCTTTTGCTGCTAGCCAACAAGAAAGCCGTCCAATCTTAACAGGTGTTCATTTGGTTATCAGTCAAAATAAATACTTTAAGGCTGTTGCGACAGATTCACACCGTATGAGTCAACGCACTTTCCAATTAGAGAAATCGGCTAATAATTTTGATTTGGTTGTTCCAAGTAAATCCCTTCGAGAATTTTCGGCTGTTTTTACAGATGATATTGAAACTGTAGAGGTTTTCTTCTCAGATAGTCAAATGTTATTTAGAAGTGAAAATATCAGCTTCTATACACGTTTGCTTGAAGGAAACTACCCTGATACTGATCGCCTCCTAACTAATCAGTTTGAAACCGAAATTATCTTTAATACAAATGCTTTACGCCATGCTATGGAACGTGCTTATTTAATTTCGAATGCAACTCAGAACGGTACTGTTCGTTTAGAAATTCAAAATGAAACAGTCTCAGCTCATGTAAACTCTCCAGAAGTTGGTAAAGTTAATGAGGAATTGGATACTGTTAGCCTTAAAGGTGATAGTTTAAATATTAGTTTTAATCCAACTTACCTAATTGAATCTTTAAAAGCAGTAAAAAGCGAAACAGTTACGATTCGATTTATTTCTCCAGTACGTCCATTTACTTTGACACCTGGTGAAGATACTGAAGATTTCATACAATTAATAACTCCTGTTCGTACTAACTAA
>orf00003
ATGATGAGTAATATGACTCTATATATAATAGCTAACCCCCATGCTGGTAATAAAAATGCCTCCACTATTGTTGGTCAAATTCAGGAGTTTTATCATACTGAAGATATTTCTGTGTTCTATACAGAACAGAAAGATGATGAAAAAAAACAAGTCATTAATATACTAAGGTCTTTTAAAGAAAGTGATCATCTAATGATTATAGGAGGAGATGGTACCTTATCAAAAGTAATGACTTATCTCCCCAACATATTCCGTGCGCTTATTATCCTGTTGGTTCGGGAAATGATTTTGCCAGAGCTTTGA

可用perl脚本count.pl来实现:

#!/usr/bin/perl
use strict;
use warnings;

#Usage:perl count.pl $ARGV[0] $ARGV[1]

my $total_len=0;
my $total_num=0;
my $total_gap=0;
my $total_gc=0;
my @len_array=();

open IN,">$ARGV[0]",or die "can't open the file $!\n"; #genome file
open OUT,">$ARGV[1]"; #result file

print OUT "ID\tGeneLength(bp)\tGapLength(bp)\tGCContent(%)\n";

$/=">";
<IN>;
while(<IN>){
next if (/^\s+$/);
chomp;
my ($id,$seq)=(split /\n/,$_,2)[0,1];
$seq=~ s/\n//g;
my $len=length ($seq);
push @len_array,$len;
$total_len+=$len;
my $gc+=($seq =~ s/G/G/g+$seq=~ s/C/C/g);
my $GC=($gc/$len)*100;
my $gap+=($seq=~ s/N/N/g);
$total_gap+=$gap;
$total_gc+=$gc;
print OUT "$id\t$len\t$gap\t";
printf OUT "%.2f\n","$GC";
$total_num++;
}

my $avg_len=($total_len/$total_num);
my $total_GC=($total_gc/$total_len)*100;
my @sort_len=sort {$a <=> $b} @len_array;

print OUT "\nTotal Stat:\n";
print OUT "Total Scaffolds Number (#):$total_num\n";
print OUT "Total Length (bp):$total_len\n";
print OUT "Gap Length (N)(bp):$total_gap\n";
printf OUT "Average Length (bp):%d\n","$avg_len";
print OUT "Minimum Length (bp):$sort_len[0]\n";
print OUT "Maximum Length (bp):$sort_len[-1]\n";
printf OUT "GC content(%%):%.2f\n","$total_GC";

close IN;
close OUT;

命令终端运行以下命令:

perl count.pl test.fa test.result.txt

便可得到以下结果test.result.txt:

ID GeneLength(bp) GapLength(bp) GCContent(%)
orf00001 1410 0 35.18
orf00002 1137 0 33.42
orf00003 303 0 33.66

Total Stat:
Total Scaffolds Number (#):3
Total Length (bp):2850
Gap Length (N)(bp):0
Average Length (bp):950
Minimum Length (bp):303
Maximum Length (bp):1410
GC content(%):34.32