RepeatMasker RepeatScout

目录 Bioinfo
#build frequency table
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency
#create fasta file containing all kinds of repeats
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas  -freq output_lmer.frequency
#filter out short (<50bp) sequences
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered_1
#run RepeatMasker on your genome of interest using filtered RepeatScout library
RepeatMasker  input_genome_sequence.fas -lib output_repeats.fas.filtered_1
#filtering putative repeats by copy number
cat output_repeats.fas.filtered_1  | filter-stage-2.prl --cat=input_genome_sequence.fas.out > output_repeats.fas.filtered_2

Biosoft

目录 Bioinfo

基因组拼接:
Phrap,Soapdenovo,mira,velvet
转录组拼接:
Cufflinks,Trinity,Oases;
比对软件:
Blast,ABblast,soap,maq,bowtie,bwa
基因结构注释:
Genscan,snap,GlimmerHMM,Augustus,Fgenesh,PASA
基因功能注释:
Interproscan,KEGG,WEGO
重复序列注释:
LTR_FINDER,LTR_STRUC,Piler,Recon,RepeatModeler,RepeatSout,Trf,MISA,RepeatMasker
非编码RNA注释:
Snoscan,tRNAscan-SE,snoGPS

Hash sort

目录 Perl

perl语言中hash排序可以按照key和value排序,各自又可分为按照数字大小和ASCII排序。如果$a和$b颠倒,则为倒序排列。

key排序
数字排序
foreach $key (sort {$a <=> $b} keys %hash)
ASCII排序
foreach $key (sort {$a cmp $b} keys %hash)
value排序
数字排序
foreach $key ( sort {$hash{$a} <=> $hash{$b}} keys %hash )
ASCII排序
foreach $key ( sort {$hash{$a} cmp $hash{$b}} keys %hash )

ERROR – /BIN/RM: ARGUMENT LIST TOO LONG !

目录 Linux

When trying to delete the files of a folder (using rm) with lots of contents in it , you might get this error :

/bin/rm : Argument list too long

The traditional rm command will not be able to delete too many files in a directory.
To get around this, use the command given below. Please note that the below command will delete all the files in the current directory in which you are logged into.

find . -name '*' | xargs rm

or:

find ./ -name '*' -print0 | xargs -0 -n 10 rm

Mac install mysql

目录 MySQL

由于官网下载安装mysql会出现各种问题,所以推荐使用brew大法安装mysql。具体命令如下:

brew update                #更新brew
brew install mysql         #安装mysql
mysql.server start         #启用mysql
mysql_secure_installation  #配置脚本安装,按照提示一直yes,设置密码
mysql -uroot -p            #输入密码登陆
mysql.server stop          #停止mysql

Homebrew MacPorts Fink

目录 MacOS

Mac电脑软件包管理工具主要有Homebrew、MacPorts、Fink。
1.Homebrew安装及常见使用命令:

sudo chown -R `whoami` /usr/local
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.bash_profile
brew update                   #更新Homebrew
brew upgrade                  #升级软件
brew search formula           #查找
brew install formula          #安装
brew uninstall formula        #卸载
brew list                     #列出已安装的软件
brew home formula             #用浏览器打开
brew info formula             #显示软件内容信息
brew deps formula             #显示软件的依赖
brew -h                       #帮助

2.MacPorts安装及常见使用命令:

sudo port -v selfupdate       #更新
port search formula           #搜索
port install formula          #安装
port uninstall formula        #卸载
port outdated                 #查看有无更新的软件以及版本
port upgrade outdated         #升级可以更新的软件
port list                     #列出所有的可用软件
port info formula             #查看软件的详细信息
port deps formula             #查看软件的依赖
port clean --all formula      #删除软件安装过程中所产生的一些临时文件
port dependents formula       #查看依赖

3.Fink安装及常见使用命令:

fink selfupdate               #升级
fink install formula          #安装
fink reinstall formula        #重新安装
fink remove formula           #卸载软件,如果想把依赖包也一起卸载,加-r
fink update-all               #更新所有的软件
fink list                     #查看可用软件
fink info formula             #查看相关软件的信息
fink show-deps formula        #显示软件的依赖关系

总结
Homebrew尽量使用系统自带的库,所有和系统依赖紧密,编译时间短。MacPorts则不用系统自带的库,所有的安装包及其依赖都安装,所以编译时间长,占用空间大,卸载比较麻烦。Fink安装包比较少,更新慢。综上,现在选择Homebrew作为软件包管理工具。

Vim set

目录 MacOS

Mac电脑有自带vim,但是默认情况下没有配置颜色、行号等,需要自己配置:

/usr/share/vim/vimrc #只读,无法修改
vim ~/.vimrc         #自己创建.vimrc文件

具体配置如下:

syntax on
set laststatus=2
set tabstop=2
set softtabstop=2
set shiftwidth=2
set expandtab
set smarttab
set autoindent
set smartindent
set number
set ruler
set backupdir=/tmp
set directory=/tmp
set ignorecase
set hls
set helplang=cn
set foldmethod=syntax

最后记得要想立即生效,输入以下命令:

source ~/.vimrc

environment variables

目录 MacOS

Mac电脑有3个文件可以配置环境变量:

/etc/bashrc      #只读,全局(公有)配置
/etc/profile     #只读,全局(公有)配置
~/.bash_profile  #用户级环境变量

一般情况下,只需要通过vim命令修改~/.bash_profile来配置环境变量(如配色,alias,PATH环境变量):

vim ~/.bash_profile
#color
export CLICOLOR=1
export LSCOLORS=GxFxCxDxBxegedabagaced
#\h:\W \u\$
export PS1='\[\033[01;33m\]\u@\h\[\033[01;31m\] \W\$\[\033[00m\] '
#/usr/local/bin/
export PATH=/usr/local/bin:$PATH
..

最后要立即生效输入以下命令:

source ~/.bash_profile

array

目录 Perl

转自脚本之家
1、去除一个数组中的重复元素:
使用grep函数代码片段:

my @array = ( 'a', 'b', 'c', 'a', 'd', 1, 2, 5, 1, 5 );
my %count;
my @uniq_times = grep { ++$count{ $_ } < 2; } @array;

使用转换hash代码片段:

my @array = ( 'a', 'b', 'c', 'a', 'd', 1, 2, 5, 1, 5 );
my %saw;
@saw{ @array } = ( );
my @uniq_array = sort keys %saw;

2、合并两个array:

push @array1, @array2;

3、快速查找最大值,不知道的程序猿们,这样搞:

my @nums = 0 .. 1000;
my $max = $nums[0];
foreach (@nums) {
    $max = $_ if $_ > $max;
}

知道的这样搞:

use List::Util qw(max);
my $max_num = max( 0 .. 1000 );

知道的他们还这样搞:

use List::Util qw(maxstr);
my $max_str = maxstr ( qw( Fido Spot Rover ) );

字符串比较玩弄于掌中。还有sum:

use List::Util qw(sum);
my $sum = sum ( 1 .. 1000 );

4、列表归并
数字求和,也可以用List::Util中的reduce:

use List::Util qw(reduce);
my $sum = reduce { $a + $b } 1 .. 1000;

与sort类似,reduce也是用code block作为参数,不过运行机制稍微不同。每次迭代,先从参数列表取出前面两个元素,分别设置为别名$a和$b,这样参数列表的长度就会缩短为两个元素。然后reduce把语句块返回的计算结果再压回到参数列表的头部。如此往复,直到最后列表里只剩下一个元素,也就是迭代的计算结果$sum。
好了,可以这样了:

my $product = reduce { $a * $b } 1 .. 1000;

5、判断是否有元素匹配
纯粹用Perl实现,找到列表中第一个符合某条件的元素,比找出所有符合条件的要麻烦一些。下面的例子,判断是否有大于1000的元素:

my $found_a_match = grep { $_ > 1000 } @list;

注意:如果@list有一亿个元素,而要找的就是1001?grep仍然还会循环一亿次,当然你可以向下面自己控制下:

my $found_a_match = 0;
foreach my $elem (@list) {
    $found_a_match = $elem if $elem > 1000;
    last if $found_a_match;
}

还是那句话,不简单~~~List::Util有现成的东西:

use List::Util qw(first);
my $found_a_match = fist { $_ > 1000 } @list;

在List::MoreUtils模块中,也提供很多的实用函数:

my $found_a_match = any { $_ > 1000 } @list;
my $all_greater = all { $_ > 1000 } @list;
my $none_greater = none { $_ > 1000 } @list;
my $all_greater = notall { $_ % 2 } @list;

6、一次遍历多个列表
一般我们同时遍历多个业务相关的列表时,往往用数组下标遍历:

my @a = ( ... );
my @b = ( ... );
my @c;
foreach my $i ( 0 .. $#list ) {
    my ( $a, $b ) = ( $a[$i], $b[$i] );
    push @c, $a + $b;
}

看下面这个,你的感觉是?

use List::MoreUtils qw(pairwise);
my @c = pairwise { $a + $b } @a, @b;

pairwise只适合两个列表的同步计算,三个后用each_array:

use List::MoreUtils qw(each_array);
my $ea = each_array( @a, @b, @c );
my @d;
while ( my ( $a, $b, $c ) = $ea->() ) {
    push @d, $a+$b+$c;
}

虽然还是有点烦,不过也还好了。
7、数组合并
合并多个数组的操作当然你可以自己写,但终究不如MoreUtils的mesh方便:

use List::MoreUtils qw(mesh);
my @odds = qw/ 1 3 5 7 9/;
my @evens= qw/ 2 4 6 8 0/;
my @nums = mesh @odds, @evens; # print: 1 2 3 4 ...

filter repeat element from array

目录 Perl

转自Perl Maven
在本节Perl教程中我们会看到如何使数组仅包含不同的值。
Perl 5没有过滤数组重复值的内建函数,但还是有很多解决方案。
List::MoreUtils
对于你的问题,可能最简单的方式是使用List::MoreUtils模块的uniq函数。

use List::MoreUtils qw(uniq);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq @words;

完整示例:

use strict;
use warnings;
use 5.010;
use List::MoreUtils qw(uniq);
use Data::Dumper qw(Dumper);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq @words;
say Dumper \@unique_words;

结果:

$VAR1 = [
  'foo',
  'bar',
  'baz',
  'zorg'
];

另外,该模块还提供了uniq的别名函数——distinct。
你得利用CPAN先安装这个模块后才能使用。
自制 uniq
如果不能安装上面的模块,或者你认为加载模块的开销太大,那么可以使用一个简单表达式也能达到同样的效果:

my @unique = do { my %seen; grep { !$seen{$_}++ } @data };

这对不熟悉的人会很奇怪,所以建议自定义uniq函数,然后在代码的其他部分使用。

use strict;
use warnings;
use 5.010;
use Data::Dumper qw(Dumper);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq( @words );
say Dumper \@unique_words;
sub uniq {
  my %seen;
  return grep { !$seen{$_}++ } @_;
}

解释自制的uniq
举例之后当然不能置之不理,下面就来解释一下。先从早点的版本开始:

my @unique;
my %seen;
foreach my $value (@words) {
  if (! $seen{$value}) {
    push @unique, $value;
    $seen{$value} = 1;
  }
}

这里用普通的foreach循环一个个处理原数组中的值,处理过程中使用了辅助哈希表%seen,因为哈希表的键是唯一的。
开始的时候哈希表是空的,所以当遇到第一个”foo”的时候$seen{“foo”}并不存在,它的值是undef,也就是Perl中的false,表示之前没有见过这个值,然后把这个值添加到@uniq数组的最后。
这里把$seen{“foo”}设为1,其实任何为true的值都可以。
下一次遇到相同的字符串时,它已经作为%seen哈希表的键且对应的值是true,所以if条件判断会失败,也就不会push重复的值到数组。
简化自制的 unique 函数
首先把赋值语句$seen{$value} = 1;替换成自增操作符$seen{$value}++。这不会改变之前的方案——任何正数都被看作TRUE,但是会把设置”seen标志”的语句放在if条件判断中。区别开使用后缀自增(而不是前缀自增)是非常重要的,因为它会在布尔表达式计算完之后再自增。当我们第一次遇到某个值时(if的布尔)表达式为TRUE,之后再遇到同一个值时为FALSE。

my @unique;
my %seen;
foreach my $value (@data) {
  if (! $seen{$value}++ ) {
    push @unique, $value;
  }
}

这已经缩短了代码,但是我们还有更好的方案。
使用grep过滤重复值
Perl的grep函数是Unix的grep命令的一般化形式。
它基本上是一个过滤器。你可以在右边传入一个数组,在代码块中传入一个表达式。grep函数会一个一个的提取数组中的值到$_(Perl的默认标量)中,然后执行代码块。如果代码块求值为TRUE,对应的值会通过过滤。如果代码块求值为FALSE,当前值会被过滤掉。
于是我们得到这样的表达式:

my %seen;
my @unique = grep { !$seen{$_}++ } @words;

包裹在’do’或’sub’中
最后要做的是把上面两个语句包裹在do代码块

my @unique = do { my %seen; grep { !$seen{$_}++ } @words };

或者放在具名函数中:

sub uniq {
my %seen;
return grep { !$seen{$_}++ } @_;
}

另一个自制uniq
如果对元素的顺序没有要求的话,可以使用Prakash Kailasa建议的更短的uniq实现版本(需要perl 5.14或更高):
内联代码:

my @unique = keys { map { $_ => 1 } @data };

或者在函数里:

my @unique = uniq(@data);
sub uniq { keys { map { $_ => 1 } @_ } };

分开讲解一下:
map的语法和grep相似:一个代码块以及一个数组。它会遍历数组的所有元素,执行代码块并把结果传递给左边。
在我们的例子里,数组中的每个元素处理之后都会和数字1一起被传递。不要忘记=>(也称胖逗号)就是个逗号。假设@data是(‘a’, ‘b’, ‘a’),那表达式会返回(‘a’, 1, ‘b’, 1, ‘a’, 1)。

map { $_ => 1 } @data

如果把这个表达式赋值给一个哈希表,那么原来的数据会作为键,数字1会作为值。尝试一下:

use strict;
use warnings;
use Data::Dumper;
my @data = qw(a b a);
my %h = map { $_ => 1 } @data;
print Dumper \%h;

输出:

$VAR1 = {
  'a' => 1,
  'b' => 1
};

如果把它包裹在花括号里而不是赋值的话,我们会得到一个匿名哈希表的引用。

{ map { $_ => 1 } @data }

Let’s see it in action: 看一下实际例子:

use strict;
use warnings;
use Data::Dumper;
my @data = qw(a b a);
my $hr = { map { $_ => 1 } @data };
print Dumper $hr;

除了哈希表的dump顺序可能改变以外,这和之前的输出结果一致。
最后,从perl 5.14开始我们可以对哈希表引用调用keys函数,所以可以这样写:

my @unique = keys { map { $_ => 1 } @data };

这样可以从@data中提取唯一值。

ssh sftp ftp

目录 MacOS

由于学习需要,经常要登录到服务器上操作,在window系统下,有许多好用的工具,如SSH Secure Shell Client等,而MacOS系统也有许多客户端,如iterm2,File Zilla,ZOL,Shuttle等。下面介绍一下MacOS终端自带的SSH,是标准的OpenSSH client。基本用法如下:

ssh user@host   #输入密码登录服务器

如果不想每次登录都输入user和host,可用alias来配置~/.ssh/config文件:

Host alias-name
    HostName ip_address
    Port 22
    User user

下次登录时直接输入以下命令即可登录:

ssh alias-name  #输入密码登录服务器

如果要本地电脑与服务器之间传输文件,则可用scp命令:

scp <local file> <remote user>@<remote machine>:<remote path>    #本地文件传到服务器上
scp <remote user>@<remote machine>:<remote file> <local path>    #服务器文件下载到本地
scp -r <local file> <remote user>@<remote machine>:<remote path> #本地文件夹传到服务器上
scp -r <remote user>@<remote machine>:<remote file> <local path> #服务器文件夹下载到本地

也可以通过sftp的put与get命令实现文件的上传与下载,具体命令如下;

sftp user@host  #输入密码登录服务器
sftp> help      #帮助信息
Available commands:
bye                                Quit sftp
cd path                            Change remote directory to 'path'
chgrp grp path                     Change group of file 'path' to 'grp'
chmod mode path                    Change permissions of file 'path' to 'mode'
chown own path                     Change owner of file 'path' to 'own'
df [-hi] [path]                    Display statistics for current directory or
                                   filesystem containing 'path'
exit                               Quit sftp
get [-afPpRr] remote [local]       Download file
reget [-fPpRr] remote [local]      Resume download file
reput [-fPpRr] [local] remote      Resume upload file
help                               Display this help text
lcd path                           Change local directory to 'path'
lls [ls-options [path]]            Display local directory listing
lmkdir path                        Create local directory
ln [-s] oldpath newpath            Link remote file (-s for symlink)
lpwd                               Print local working directory
ls [-1afhlnrSt] [path]             Display remote directory listing
lumask umask                       Set local umask to 'umask'
mkdir path                         Create remote directory
progress                           Toggle display of progress meter
put [-afPpRr] local [remote]       Upload file
pwd                                Display remote working directory
quit                               Quit sftp
rename oldpath newpath             Rename remote file
rm path                            Delete remote file
rmdir path                         Remove remote directory
symlink oldpath newpath            Symlink remote file
version                            Show SFTP version
!command                           Execute 'command' in local shell
!                                  Escape to local shell
?                                  Synonym for help
sftp>put /path/filename(本地主机) /path/filename(远端主机) #上传文件
sftp>get /path/filename(远端主机) /path/filename(本地主机) #下载文件

如果支持ftp的服务器,也可用用ftp命令登录,通过get、mget与put、mput实现文件的下载与上传:

ftp example.com
Connected to example.com.
220 Welcome to www.net.cn FTP service.
Name (example.com:$): user
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> help
Commands may be abbreviated.  Commands are:
!       chmod       fget        less        mlst        open        pwd     rhelp       struct      verbose
$       close       form        lpage       mode        page        quit        rmdir       sunique     xferbuf
account     cr      ftp     lpwd        modtime     passive     quote       rstatus     system      ?
append      debug       gate        ls      more        pdir        rate        runique     tenex
ascii       delete      get     macdef      mput        pls     rcvbuf      send        throttle
bell        dir     glob        mdelete     mreget      pmlsd       recv        sendport    trace
binary      disconnect  hash        mdir        msend       preserve    reget       set     type
bye     edit        help        mget        newer       progress    remopts     site        umask
case        epsv4       idle        mkdir       nlist       prompt      rename      size        unset
cd      exit        image       mls     nmap        proxy       reset       sndbuf      usage
cdup        features    lcd     mlsd        ntrans      put     restart     status      user
ftp

Perl Translation protein

目录 Perl

将CDS翻译成蛋白的脚本cds2aa.pl

#!/usr/bin/perl
use strict;

if(scalar @ARGV==0){
    die "This program is used to trans cds to pep
    perl $0 <cds file> <pep>\n"
;
}

open IN,"<$ARGV[0]";
open OUT,">$ARGV[1]";

$/=">";
<IN>;

while (<IN>){
    chomp;
    next if (/^\s+$/);
    my ($id,$dna)=(split /\n/,$_,2)[0,1];
    $dna=~ s/\n//g;
    my $protein="";
    for (my $i=0;$i< (length($dna)-2);$i+=3){
        $protein.= &codon2aa (substr ($dna,$i,3));
    }
    print OUT ">$id\n";
    print OUT "$protein\n";
}
close IN;
$/="\n";

sub codon2aa {
    my ($codon)=@_;
    $codon=uc $codon;
    my(%genetic_code)=(
    'TCA'=>'S',#Serine
    'TCC'=>'S',#Serine
    'TCG'=>'S',#Serine
    'TCT'=>'S',#Serine
   
    'TTC'=>'F',#Phenylalanine;
    'TTT'=>'F',#Phenylalanine;
    'TTA'=>'L',#Leucine
    'TTG'=>'L',#Leucine
   
    'TAC'=>'Y',#Tyrosine
    'TAT'=>'Y',#Tyrosine
    'TAA'=>'', #Stop
    'TAG'=>'', #Stop
   
    'TGC'=>'C',#Cysteine
    'TGT'=>'C',#Cysteine
    'TGA'=>'', #Stop
    'TGG'=>'W',#Tryptophan
   
    'CTA'=>'L',#Leucine
    'CTC'=>'L',#Leucine
    'CTG'=>'L',#Leucine
    'CTT'=>'L',#Leucine
   
    'CCA'=>'P',#Proline
    'CCC'=>'P',#Proline
    'CCG'=>'P',#Proline
    'CCT'=>'P',#Proline
   
    'CAC'=>'H',#Histidine
    'CAT'=>'H',#Histidine
    'CAA'=>'Q',#Glutamine
    'CAG'=>'Q',#Glutamine
   
    'CGA'=>'R',#Arginine
    'CGC'=>'R',#Arginine
    'CGG'=>'R',#Arginine
    'CGT'=>'R',#Arginine
   
    'ATA'=>'I',#Isoleucine
    'ATC'=>'I',#Isoleucine
    'ATT'=>'I',#Isoleucine
    'ATG'=>'M',#Methionine
   
    'ACA'=>'T',#Threonine
    'ACC'=>'T',#Threonine
    'ACG'=>'T',#Threonine
    'ACT'=>'T',#Threonine
   
    'AAC'=>'N',#Asparagine
    'AAT'=>'N',#Asparagine
    'AAA'=>'K',#Lysine
    'AAG'=>'K',#Lysine
   
    'AGC'=>'S',#Serine
    'AGT'=>'S',#Serine
    'AGA'=>'R',#Arginine
    'AGG'=>'R',#Arginine
   
    'GTA'=>'V',#Valine
    'GTC'=>'V',#Valine
    'GTG'=>'V',#Valine
    'GTT'=>'V',#Valine
   
    'GCA'=>'A',#Alanine
    'GCC'=>'A',#Alanine
    'GCG'=>'A',#Alanine
    'GCT'=>'A',#Alanine
   
    'GAC'=>'D',#Aspartic Acid
    'GAT'=>'D',#Aspartic Acid
    'GAA'=>'E',#Glutamine Acid
    'GAG'=>'E',#Glutamine Acid
   
    'GGA'=>'G',#Glycine
    'GGC'=>'G',#Glycine
    'GGG'=>'G',#Glycine
    'GGT'=>'G',#Glycine
    );
    if (exists $genetic_code{$codon}) {
        return "$genetic_code{$codon}";
    }else{
        return "X";
    }
}

测试的CDS文件test.fa:

>orf00001
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGAGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTCGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAGGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>orf00002
ATGATTCATTTTTCAATAAATAAAAATTTCTTCTTGCATGCTCTAACGGTAACCAAACGAGCTATTAGTCATAAAAATGCGATTCCAATCCTTTCAACTGTTAAAATAGAAGTGACTAGAGATGCTATCATTTTAACGGGGTCAAATGGACAAATTTCAATTGAAAATACTATTCCTGCTTCAAATGAAAACGCAGGTTTACTAGTAACGAATCCAGGCTCTATTTTGTTAGAAGCTGGTTTCTTTATTAATATTATTTCAAGTTTACCAGATGTAACTTTAGAATTTACAGAGATTGAACAACATCAAATTGTTCTTACTAGTGGAAAATCAGAGATTACTTTGAAAGGTAAGGATGTCGATCAATACCCTCGTCTGCAGGAAATGACAACAGATACTCCATTAACATTAGAAACTAAACTGTTAAAATCAATTATTAATGAAACTGCTTTTGCTGCTAGCCAACAAGAAAGCCGTCCAATCTTAACAGGTGTTCATTTGGTTATCAGTCAAAATAAATACTTTAAGGCTGTTGCGACAGATTCACACCGTATGAGTCAACGCACTTTCCAATTAGAGAAATCGGCTAATAATTTTGATTTGGTTGTTCCAAGTAAATCCCTTCGAGAATTTTCGGCTGTTTTTACAGATGATATTGAAACTGTAGAGGTTTTCTTCTCAGATAGTCAAATGTTATTTAGAAGTGAAAATATCAGCTTCTATACACGTTTGCTTGAAGGAAACTACCCTGATACTGATCGCCTCCTAACTAATCAGTTTGAAACCGAAATTATCTTTAATACAAATGCTTTACGCCATGCTATGGAACGTGCTTATTTAATTTCGAATGCAACTCAGAACGGTACTGTTCGTTTAGAAATTCAAAATGAAACAGTCTCAGCTCATGTAAACTCTCCAGAAGTTGGTAAAGTTAATGAGGAATTGGATACTGTTAGCCTTAAAGGTGATAGTTTAAATATTAGTTTTAATCCAACTTACCTAATTGAATCTTTAAAAGCAGTAAAAAGCGAAACAGTTACGATTCGATTTATTTCTCCAGTACGTCCATTTACTTTGACACCTGGTGAAGATACTGAAGATTTCATACAATTAATAACTCCTGTTCGTACTAACTAA
>orf00003
ATGATGAGTAATATGACTCTATATATAATAGCTAACCCCCATGCTGGTAATAAAAATGCCTCCACTATTGTTGGTCAAATTCAGGAGTTTTATCATACTGAAGATATTTCTGTGTTCTATACAGAACAGAAAGATGATGAAAAAAAACAAGTCATTAATATACTAAGGTCTTTTAAAGAAAGTGATCATCTAATGATTATAGGAGGAGATGGTACCTTATCAAAAGTAATGACTTATCTCCCCAACATATTCCGTGCGCTTATTATCCTGTTGGTTCGGGAAATGATTTTGCCAGAGCTTTGA

测试命令:

perl cds2aa.pl test.fa test.pep.txt

测试结果文件test.pep.txt

>orf00001
MVQYNNNYPQDNKEEAMTENEQLFWNRVLELSRSQIAPAAYEFFVLEARLLKIEHQTAVITLDNIEMKKLFWEQNLGPVILTAGFEIFNAEITANYVSNDLHLQETSFSNYQQSSNEVNTLPIRKIDSNLKEKYTFANFVQGDENRWAVSASIAVADSPGTTYNPLFIWGGPGLGKTHLLNAIGNQVLRDNPNARVLYITAENFINEFVSHIRLDSMEELKEKFRNLDLLLIDDIQSLAKKTLGGTQEEFFNTFNALHTNDKQIVLTSDRNPNQLNDLEERLVTRFSWGLPVNITPPDFETRVAILTNKIQEYPYDFPQDTIEYLAGEFDSNVRELEGALKNISLVADFKHAKTITVDIAAEAIRARKNDGPIVTVIPIEEIQIQVGKFYGVTVKEIKATKRTQDIVLARQVAMYLAREMTDNSLPKIGKEFGGRDHSTVLHAYNKIKNMVAQDDNLRIEIETIKNKIR
>orf00002
MIHFSINKNFFLHALTVTKRAISHKNAIPILSTVKIEVTRDAIILTGSNGQISIENTIPASNENAGLLVTNPGSILLEAGFFINIISSLPDVTLEFTEIEQHQIVLTSGKSEITLKGKDVDQYPRLQEMTTDTPLTLETKLLKSIINETAFAASQQESRPILTGVHLVISQNKYFKAVATDSHRMSQRTFQLEKSANNFDLVVPSKSLREFSAVFTDDIETVEVFFSDSQMLFRSENISFYTRLLEGNYPDTDRLLTNQFETEIIFNTNALRHAMERAYLISNATQNGTVRLEIQNETVSAHVNSPEVGKVNEELDTVSLKGDSLNISFNPTYLIESLKAVKSETVTIRFISPVRPFTLTPGEDTEDFIQLITPVRTN
>orf00003
MMSNMTLYIIANPHAGNKNASTIVGQIQEFYHTEDISVFYTEQKDDEKKQVINILRSFKESDHLMIIGGDGTLSKVMTYLPNIFRALIILLVREMILPEL

Perl count genes

目录 Perl

经常遇到要统计fasta格式基因组的各项指标,如以下test.fa:

>orf00001
ATGGTACAATATAACAATAATTATCCACAAGACAATAAGGAAGAAGCTATGACGGAAAACGAACAACTATTTTGGAATAGAGTACTAGAGCTATCTCGTTCTCAAATAGCACCAGCAGCTTATGAATTTTTTGTTCTAGAGGCTAGACTCCTCAAAATTGAACATCAAACTGCAGTTATTACTTTAGATAACATTGAAATGAAAAAGCTATTCTGGGAACAAAATTTGGGGCCTGTTATCCTAACAGCTGGTTTTGAAATTTTCAATGCTGAAATTACAGCTAACTATGTCTCAAACGATTTACATTTACAAGAAACTAGTTTTTCTAACTACCAGCAATCTAGCAATGAAGTAAATACTTTACCAATTAGAAAAATCGACTCTAATCTTAAAGAGAAATATACTTTTGCTAATTTTGTTCAAGGAGATGAAAATAGATGGGCTGTTTCAGCATCAATTGCTGTAGCTGATAGTCCTGGCACGACTTATAATCCTCTATTTATCTGGGGAGGACCTGGTCTAGGAAAGACGCATCTACTAAATGCTATTGGAAATCAAGTCTTAAGAGATAATCCAAACGCGAGGGTTTTATACATCACTGCTGAGAATTTTATTAATGAATTCGTCAGTCATATTCGTTTAGATTCGATGGAAGAATTAAAAGAAAAGTTTCGCAACTTGGACTTACTCCTGATTGATGATATTCAGTCGCTTGCTAAGAAAACCTTAGGGGGGACCCAAGAGGAGTTCTTCAATACTTTCAATGCTTTACATACAAACGATAAACAAATCGTATTGACCAGTGACCGAAATCCAAATCAATTAAATGATCTAGAAGAACGTCTAGTCACGCGCTTTAGTTGGGGACTCCCAGTAAATATCACACCACCTGATTTTGAAACACGAGTTGCTATTTTAACCAATAAAATTCAAGAATATCCTTATGATTTTCCTCAAGATACCATTGAATACTTAGCAGGAGAATTTGATTCCAACGTACGTGAATTAGAAGGAGCCTTGAAAAATATTAGTCTAGTTGCTGACTTTAAGCATGCTAAAACTATTACAGTAGATATAGCTGCAGAAGCTATCAGAGCACGTAAAAATGACGGTCCTATTGTTACTGTCATTCCTATAGAAGAAATTCAAATACAGGTTGGTAAATTCTATGGCGTAACTGTAAAAGAGATAAAAGCAACTAAAAGAACACAAGATATTGTCCTTGCAAGACAGGTAGCCATGTACTTAGCTCGTGAGATGACAGATAACAGTCTCCCCAAAATAGGTAAAGAATTTGGGGGACGAGATCACTCAACTGTTCTCCACGCTTATAATAAAATAAAAAATATGGTTGCTCAAGATGACAACTTACGAATTGAGATAGAAACTATCAAAAATAAAATCAGGTAG
>orf00002
ATGATTCATTTTTCAATAAATAAAAATTTCTTCTTGCATGCTCTAACGGTAACCAAACGAGCTATTAGTCATAAAAATGCGATTCCAATCCTTTCAACTGTTAAAATAGAAGTGACTAGAGATGCTATCATTTTAACGGGGTCAAATGGACAAATTTCAATTGAAAATACTATTCCTGCTTCAAATGAAAACGCAGGTTTACTAGTAACGAATCCAGGCTCTATTTTGTTAGAAGCTGGTTTCTTTATTAATATTATTTCAAGTTTACCAGATGTAACTTTAGAATTTACAGAGATTGAACAACATCAAATTGTTCTTACTAGTGGAAAATCAGAGATTACTTTGAAAGGTAAGGATGTCGATCAATACCCTCGTCTGCAGGAAATGACAACAGATACTCCATTAACATTAGAAACTAAACTGTTAAAATCAATTATTAATGAAACTGCTTTTGCTGCTAGCCAACAAGAAAGCCGTCCAATCTTAACAGGTGTTCATTTGGTTATCAGTCAAAATAAATACTTTAAGGCTGTTGCGACAGATTCACACCGTATGAGTCAACGCACTTTCCAATTAGAGAAATCGGCTAATAATTTTGATTTGGTTGTTCCAAGTAAATCCCTTCGAGAATTTTCGGCTGTTTTTACAGATGATATTGAAACTGTAGAGGTTTTCTTCTCAGATAGTCAAATGTTATTTAGAAGTGAAAATATCAGCTTCTATACACGTTTGCTTGAAGGAAACTACCCTGATACTGATCGCCTCCTAACTAATCAGTTTGAAACCGAAATTATCTTTAATACAAATGCTTTACGCCATGCTATGGAACGTGCTTATTTAATTTCGAATGCAACTCAGAACGGTACTGTTCGTTTAGAAATTCAAAATGAAACAGTCTCAGCTCATGTAAACTCTCCAGAAGTTGGTAAAGTTAATGAGGAATTGGATACTGTTAGCCTTAAAGGTGATAGTTTAAATATTAGTTTTAATCCAACTTACCTAATTGAATCTTTAAAAGCAGTAAAAAGCGAAACAGTTACGATTCGATTTATTTCTCCAGTACGTCCATTTACTTTGACACCTGGTGAAGATACTGAAGATTTCATACAATTAATAACTCCTGTTCGTACTAACTAA
>orf00003
ATGATGAGTAATATGACTCTATATATAATAGCTAACCCCCATGCTGGTAATAAAAATGCCTCCACTATTGTTGGTCAAATTCAGGAGTTTTATCATACTGAAGATATTTCTGTGTTCTATACAGAACAGAAAGATGATGAAAAAAAACAAGTCATTAATATACTAAGGTCTTTTAAAGAAAGTGATCATCTAATGATTATAGGAGGAGATGGTACCTTATCAAAAGTAATGACTTATCTCCCCAACATATTCCGTGCGCTTATTATCCTGTTGGTTCGGGAAATGATTTTGCCAGAGCTTTGA

可用perl脚本count.pl来实现:

#!/usr/bin/perl
use strict;
use warnings;

#Usage:perl count.pl $ARGV[0] $ARGV[1]

my $total_len=0;
my $total_num=0;
my $total_gap=0;
my $total_gc=0;
my @len_array=();

open IN,">$ARGV[0]",or die "can't open the file $!\n"; #genome file
open OUT,">$ARGV[1]"; #result file

print OUT "ID\tGeneLength(bp)\tGapLength(bp)\tGCContent(%)\n";

$/=">";
<IN>;
while(<IN>){
next if (/^\s+$/);
chomp;
my ($id,$seq)=(split /\n/,$_,2)[0,1];
$seq=~ s/\n//g;
my $len=length ($seq);
push @len_array,$len;
$total_len+=$len;
my $gc+=($seq =~ s/G/G/g+$seq=~ s/C/C/g);
my $GC=($gc/$len)*100;
my $gap+=($seq=~ s/N/N/g);
$total_gap+=$gap;
$total_gc+=$gc;
print OUT "$id\t$len\t$gap\t";
printf OUT "%.2f\n","$GC";
$total_num++;
}

my $avg_len=($total_len/$total_num);
my $total_GC=($total_gc/$total_len)*100;
my @sort_len=sort {$a <=> $b} @len_array;

print OUT "\nTotal Stat:\n";
print OUT "Total Scaffolds Number (#):$total_num\n";
print OUT "Total Length (bp):$total_len\n";
print OUT "Gap Length (N)(bp):$total_gap\n";
printf OUT "Average Length (bp):%d\n","$avg_len";
print OUT "Minimum Length (bp):$sort_len[0]\n";
print OUT "Maximum Length (bp):$sort_len[-1]\n";
printf OUT "GC content(%%):%.2f\n","$total_GC";

close IN;
close OUT;

命令终端运行以下命令:

perl count.pl test.fa test.result.txt

便可得到以下结果test.result.txt:

ID GeneLength(bp) GapLength(bp) GCContent(%)
orf00001 1410 0 35.18
orf00002 1137 0 33.42
orf00003 303 0 33.66

Total Stat:
Total Scaffolds Number (#):3
Total Length (bp):2850
Gap Length (N)(bp):0
Average Length (bp):950
Minimum Length (bp):303
Maximum Length (bp):1410
GC content(%):34.32

install GD on MacOS

目录 MacOS

转自Qinhu Wang | 王秦虎
Various applications depend on library GD, however, you will alway fail on installing GD by CPAN.
I wrapped all the latest of zlib/libpng/jpeg/freetype/libgd/GD moudels and successfully installed GD on Mavericks (Mac OSX 10.9), hope this could help the one who have the similar environments.
Also works on Yosemite (OS X 10.10).
Note Xcode should be installed first (if you haven’t) and make sure you have a sudo privilege.
Download all the package here.
Extract the package gd.tar.gz and type sudo ./install to install.
This is an alternative approach:

curl -O http://wangqinhu.com/data/gd/gd.tar.gz
tar -zxf gd.tar.gz
cd gd
sudo ./install

Here is the script if you want to modify for your situation (also supplied in the package).

#!/bin/sh
#
# Install zlib/libpng/jpeg/freetype/libgd/GD
#
# Wang Qinhu (qinhu.wang # gmail.com)
# Nov 17, 2013
# version 0.2
#
#install zlib
echo "><><><><><><><><><"
echo "Installing zlib"
echo "><><><><><><><><><"
tar -zxf zlib-1.2.8.tar.xz
cd zlib-1.2.8
./configure
make
sudo make install
cd ..
#install libpng
echo "><><><><><><><><><"
echo "Installing libpng"
echo "><><><><><><><><><"
#install jpegb
echo "><><><><><><><><><"
echo "Installing libjpeg"
echo "><><><><><><><><><"
tar -zxf jpegsrc.v9.tar.gz
cd jpeg-9
./configure
make
sudo make install
cd ..
#install freetype
echo "><><><><><><><><><"
echo "Installing freetype"
echo "><><><><><><><><><"
tar -zxf freetype-2.5.0.1.tar.bz2
cd freetype-2.5.0.1
./configure
make
sudo make install
cd ..
#install libgd
echo "><><><><><><><><><"
echo "Installing libgd"
echo "><><><><><><><><><"
tar -zxf libgd-2.1.0.tar.xz
cd libgd-2.1.0
./configure --with-zlib=/usr/local --with-jpeg=/usr/local --with-png=/usr/local --with-freetype=/usr/local
make
sudo make install
cd ..
#install GD
echo "><><><><><><><><><"
echo "Installing GD"
echo "><><><><><><><><><"
tar -zxf GD-2.50.tar.gz
cd GD-2.50
perl Makefile.PL
make
sudo make install
cd ..
#clean
rm -rf zlib-1.2.8 libpng-1.6.6 jpeg-9 freetype-2.5.0.1 libgd-2.1.0 GD-2.50
#test
echo "GD" | xargs -I MODULE perl -e 'print eval "use MODULE;1"?"\n\n\nOK\n\n\n":"\n\n\nFail\n\n\n"'

circos install

目录 Circos

1.下载circosPerl(注意自己电脑是32bit还是64bit);
2.安装Perl到 C:\Strawberry\,将解压后的circos文件夹移到Strawberry目录下;
3.点击“开始”→“运行”→输入“CMD”,依次输入下面命令:

cd \Strawberry #移到安装目录
perl –version #查看Perl是否安装成功
cpan #进入cpan
install Font::TTF::Font Config::General Math::Bezier Math::VecStat Readonly Text::Format Statistics::Basic Set::IntSpan Regexp::Common SVG Getopt::Long List::MoreUtils Params::Validate Math::Round Carp Clone Cwd Date::Dumper Digest::MD5 File::Spec::Functions File::Temp GD IO::File List::Util Memoize POSIX Pod::Usage Storable Sys::Hostname Text::Balanced Time::HiRes #安装模块
cd circos-0.69/ #移到circos目录
perl ./bin/circos -conf ./example/etc/circos.conf #测试能否运行circos