util.blast {CHNOSZ}R Documentation

Functions to Work with BLAST Output Files

Description

Read and filter BLAST tabular output files, make taxonomic identifications of the BLAST hits using gi numbers, write trimmed-down BLAST files.

Usage

  read.blast(file, similarity = 30, evalue = 1e-5, max.hits = 1, 
    quiet = FALSE)
  id.blast(blast, gi.taxid, taxid.names, min.taxon = 0, 
    min.query = 0, min.phylum = 0, take.first = TRUE)
  write.blast(blast, outfile)
  def2gi(def)

Arguments

file character, name of BLAST tabular output file.
similarity numeric, hits above this similarity score are kept.
evalue character, hits below this E value are kept.
max.hits numeric, up to this many hits are kept for each query sequence.
quiet logical, produce fewer messages?
blast dataframe, BLAST table.
gi.taxid list, first component is sequence identifiers (gi numbers), second is taxon ids (taxids).
taxid.names dataframe, with at least columns taxid (taxon id), phylum (name of phylum), species (name of species).
min.taxon numeric, this taxon is kept if it makes up at least this fraction of total.
min.query numeric, query sequence is counted if a single phylum makes up this fraction of its hits.
min.phylum numeric, this phylum is kept if it makes up at least this fraction of total.
take.first logical, keep only first hit after all other filtering steps?
outfile character, name of output file.
def character, FASTA defline(s)

Details

read.blast reads a BLAST tabular output file, keeping only those hits with greater than or equal to the similarity and less than or equal to the evalue (expectation value) specified in the arguments. Furthermore, for each query sequence, only the top number of hits specified by max.hits are kept. Note that BLAST (Altschul et al., 1997) tabular output files can be generated using the -m 8 switch to the ‘blastall’ command.

id.blast takes a BLAST table (i.e., the output of read.blast) and finds the taxonomic ID, phylum and species name for each hit (subject sequence). The BLAST results are tied to taxids using gi.taxid, which is a list consisting of gi and taxid numeric vectors. Any subject sequence identifiers appearing in the BLAST file that do not match gi numbers in the gi.taxid list are dropped. The taxid.names dataframe lists the phylum and species names for each taxid.

id.blast furthermore performs three possible filtering steps, which are all disabled by default. If one or more of the arguments is set to a non-zero value, its operation is performed, in this order. Any taxon that does not initially make up at least the fraction of total hits given by min.taxon is removed. Any query sequence that does not have a single phylum making up at least the fraction of hits (for each query sequence) given by min.query is removed. Finally, any phylum that does not make up at least the fraction of total hits given by min.phylum is removed.

By default, for take.first equal to TRUE, id.blast performs a final filtering step (but min.query must be disabled). Only the first hit for each query sequence is kept.

write.blast takes a BLAST table (the output of read.blast) and writes to outfile a stripped-down BLAST file with empty values in the columns except for columns 1 (query sequence ID), 2 (hit sequence ID), 3 (similarity), 11 (E value). In the process, def2gi is used to extract the GI numbers for the hit sequences that are then kept in the second column. This function is used to reduce the size of the example BLAST files that are packaged with CHNOSZ (see the ‘bison’ section in extdata).

def2gi extracts the GI number from a FASTA defline.

Value

read.blast returns a dataframe with as many columns (12) as the BLAST file. id.blast returns a dataframe with columns query, subject (i.e., sequence id or gi number), similarity, evalue, taxid, phylum and species. write.blast invisible-y returns the results (that are also written to outfile).

References

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. H., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. http://dx.doi.org/doi:10.1093/nar/25.17.3389

See Also

The extdata/refseq directory contains the taxid_names.csv.xz file for microbial taxa, which can be used for the taxid.names in id.blast.

Examples

  ## using def2gi
  def <- "gi|218295810|ref|ZP_03496590.1|"
  stopifnot(all.equal(def2gi(def), "218295810"))

  ## process some of the BLAST output for proteins
  ## from Bison Pool metagenome (JGI, 2007)
  # read the file that connects taxids with the sequence identifier
  tfile <- system.file("extdata/bison/gi.taxid.txt.xz", package="CHNOSZ")
  gi.taxid <- scan(tfile, what=as.list(character(2)), flush=TRUE)
  # read the file that connects names with the taxids
  nfile <- system.file("extdata/refseq/taxid_names.csv.xz", package="CHNOSZ")
  taxid.names <- read.csv(nfile)
  # the BLAST files
  sites <- c("N","S","R","Q","P")
  bfile <- paste("extdata/bison/bison", sites, "_vs_refseq47.blastp.xz", sep="")
  for(i in 1:5) {
    file <- system.file(bfile[i], package="CHNOSZ")
    # read the blast file, with default filtering settings
    bl <- read.blast(file)
    # process the blast file -- get taxon names
    ib <- id.blast(bl, gi.taxid, taxid.names, min.taxon=2^-7)
    # count each of the phyla
    bd <- as.matrix(sapply(unique(ib$phylum), function(x) (sum(x==ib$phylum))))
    colnames(bd) <- sites[i]
    # make a matrix -- each column for a different file
    if(i==1) bardata <- bd else {
      bardata <- merge(bardata, bd, all=TRUE, by="row.names")
      rownames(bardata) <- bardata$Row.names
      bardata <- bardata[,-1]
    }
  }
  # normalize the counts
  bardata[is.na(bardata)] <- 0
  bardata <- t(t(bardata)/colSums(bardata))
  # make a bar chart
  bp <- barplot(as.matrix(bardata), col=rainbow(nrow(bardata)),
    xlab="location", ylab="fractional abundance")
  # add labels to the bars
  names <- substr(row.names(bardata), 1, 3)
  for(i in 1:5) {
    bd <- bardata[,i]
    ib <- bd!=0
    y <- (cumsum(bd) - bd/2)[ib]
    text(bp[i], y, names[ib])
  }
  title(main=paste("Most Abundant Phyla in a Portion",
    "of the Bison Pool Metagenome", sep="\n"))

[Package CHNOSZ version 0.9-7 Index]