seqaln.Rd
Create multiple alignments of amino acid or nucleotide sequences according to the method of Edgar.
seqaln(aln, id=NULL, profile=NULL, exefile="muscle", outfile="aln.fa", protein=TRUE, seqgroup=FALSE, refine=FALSE, extra.args="", verbose=FALSE, web.args = list(), ...)
aln | a sequence character matrix, as obtained from
|
---|---|
id | a vector of sequence names to serve as sequence identifers. |
profile | a profile alignment of class ‘fasta’
(e.g. obtained from |
exefile | file path to the ‘MUSCLE’ program on your system (i.e.
how is ‘MUSCLE’ invoked). Alternatively, ‘CLUSTALO’ can
be used. Also supported is using the ‘msa’ package from Bioconductor
(need to install packages using |
outfile | name of ‘FASTA’ output file to which alignment should be written. |
protein | logical, if TRUE the input sequences are assumed to be protein not DNA or RNA. |
seqgroup | logical, if TRUE similar sequences are grouped together in the output. |
refine | logical, if TRUE the input sequences are assumed to already be aligned, and only tree dependent refinement is performed. |
extra.args | a single character string containing extra command line arguments for the alignment program. |
verbose | logical, if TRUE ‘MUSCLE’ warning and error messages are printed. |
web.args | a ‘list’ object containing arguments to perform online sequence alignment using EMBL-EBI Web Services. See below for details. |
... | additional arguments passed to the function
|
Sequence alignment attempts to arrange the sequences of protein, DNA or RNA, to highlight regions of shared similarity that may reflect functional, structural, and/or evolutionary relationships between the sequences.
Aligned sequences are represented as rows within a matrix. Gaps (‘-’) are inserted between the aminoacids or nucleotides so that equivalent characters are positioned in the same column.
This function calls the ‘MUSCLE’ program to perform a multiple sequence alignment, which must be installed on your system and in the search path for executables. If local ‘MUSCLE’ can not be found, alignment can still be performed via online web services (see below) with limited features.
If you have a large number of input sequences (a few thousand), or they are
very long, the default settings may be too slow for practical
use. A good compromise between speed and accuracy is to run just the
first two iterations of the ‘MUSCLE’ algorithm by setting the
extra.args
argument to “-maxiters 2”.
You can set ‘MUSCLE’ to improve an existing alignment by setting
refine
to TRUE.
To inspect the sequence clustering used by ‘MUSCLE’ to produce
alignments, include “-tree2 tree.out” in the extra.args
argument. You can then load the “tree.out” file with the
‘read.tree’ function from the ‘ape’ package.
‘CLUSTALO’ can be used as an alternative to ‘MUSCLE’ by
specifiying exefile='clustalo'
. This might be useful e.g. when
adding several sequences to a profile alignment.
If local ‘MUSCLE’ or ‘CLUSTALO’ program is unavailable, the alignment
can be performed via the ‘msa’ package from the Bioconductor repository.
To do so, set exefile="msa"
. Note that both ‘msa’ and
‘Biostrings’ packages need to be installed properly using BiocManager::install()
.
If the access to any method metioned above fails,
the function will attempt to perform alignment via the EMBL-EBI Web Services
(See http://www.ebi.ac.uk). In this case, the argument web.args
cannot be empty and must contain at least user's E-Mail address.
Note that as stated by EBI, a fake email address may result
in your jobs being killed and your IP, organisation or entire domain being
black-listed (See http://www.ebi.ac.uk/Tools/webservices/help/faq).
Possible parameters to be passed via web.args
include:
a string containing a valid E-Mail address. Required.
a string for the title of the job to be submitted to the remote server. Optional.
integer specifying the number of seconds to wait for the response of the server before a time out occurs. Default: 90.
An example of usage is web.args=list(email='user_id@email.provider')
.
Returns a list of class "fasta"
with the following components:
an alignment character matrix with a row per sequence and a column per equivalent aminoacid/nucleotide.
sequence names as identifers.
the matched call.
Grant, B.J. et al. (2006) Bioinformatics 22, 2695--2696.
‘MUSCLE’ is the work of Edgar: Edgar (2004) Nuc. Acid. Res. 32, 1792--1797.
Full details of the ‘MUSCLE’ algorithm, along with download and
installation instructions can be obtained from:
http://www.drive5.com/muscle.
Barry Grant
A system call is made to the ‘MUSCLE’ program, which must be installed on your system and in the search path for executables. See http://thegrantlab.org/bio3d/tutorials/installing-bio3d for instructions of how to install this program.
read.fasta
, read.fasta.pdb
,
get.seq
, seqbind
,
pdbaln
, plot.fasta
,
blast.pdb
if (FALSE) { ##-- Basic sequence alignemnt seqs <- get.seq(c("4q21_A", "1ftn_A")) aln <- seqaln(seqs) ##-- add a sequence to the (profile) alignment seq <- get.seq("1tnd_A") aln <- seqaln(seq, profile=aln) ##-- Read a folder/directory of PDB files #pdb.path <- "my_dir_of_pdbs" #files <- list.files(path=pdb.path , # pattern=".pdb", # full.names=TRUE) ##-- Use online files files <- get.pdb(c("4q21","1ftn"), URLonly=TRUE) ##-- Extract and store sequences raw <- NULL for(i in 1:length(files)) { pdb <- read.pdb(files[i]) raw <- seqbind(raw, pdbseq(pdb) ) } ##-- Align these sequences aln <- seqaln(raw, id=files, outfile="seqaln.fa") ##-- Read Aligned PDBs storing coordinate data pdbs <- read.fasta.pdb(aln) ## Sequence identity seqidentity(aln) ## Note that all the above can be done with the pdbaln() function: #pdbs <- pdbaln(files) ##-- For identical sequences with masking use a custom matrix aa <- seqbind(c("X","C","X","X","A","G","K"), c("C","-","A","X","G","X","X","K")) aln <- seqaln(aln=aln, id=c("a","b"), outfile="temp.fas", protein=TRUE, extra.args= paste("-matrix", system.file("matrices/custom.mat", package="bio3d"), "-gapopen -3.0 ", "-gapextend -0.5", "-center 0.0") ) }