util.fasta {CHNOSZ}R Documentation

Functions for Reading FASTA Files and Downloading from UniProt

Description

Search the header lines of a FASTA file, read protein sequences from a file, count numbers of amino acids in each sequence, and download sequences from UniProt.

Usage

  read.fasta(file, iseq = NULL, ret = "count", lines = NULL, 
    ihead = NULL, start=NULL, stop=NULL, type="protein", id = NULL)
  count.aa(seq, start=NULL, stop=NULL, type="protein")

Arguments

file

character, path to FASTA file

iseq

numeric, which sequences to read from the file

ret

character, specification for type of return (count, sequence, or FASTA format)

lines

list of character, supply the lines here instead of reading them from file

ihead

numeric, which lines are headers

start

numeric, position in sequence to start counting

stop

numeric, position in sequence to stop counting

type

character, sequence type (protein or DNA)

id

character, value to be used for protein in output table

seq

character, amino acid sequence of a protein

Details

read.fasta is used to retrieve entries from a FASTA file. Use iseq to select the sequences to read (the default is all sequences). The function returns various formats depending on the value of ret. The default ‘⁠count⁠’ returns a data frame of amino acid counts (the data frame can be given to add.protein in order to add the proteins to thermo$protein), ‘⁠seq⁠’ returns a list of sequences, and ‘⁠fas⁠’ returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). If the line numbers of the header lines were previously determined, they can be supplied in ihead. Optionally, the lines of a previously read file may be supplied in lines (in this case no file is needed so file should be set to ""). When ret is ‘⁠count⁠’, the names of the proteins in the resulting data frame are parsed from the header lines of the file, unless id is provided. If id is not given, and a UniProt FASTA header is detected (regular expression "\|......\|.*_"), information there (accession, name, organism) is split into the protein, abbrv, and organism columns of the resulting data frame.

count.aa counts the occurrences of each amino acid or nucleic-acid base in a sequence (seq). For amino acids, the columns in the returned data frame are in the same order as thermo()$protein. The matching of letters is case-insensitive. A warning is generated if any character in seq, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations. start and/or stop can be provided to count a fragment of the sequence (extracted using substr). If only one of start or stop is present, the other defaults to 1 (start) or the length of the sequence (stop).

Value

read.fasta returns a list of sequences or lines (for ret equal to ‘⁠seq⁠’ or ‘⁠fas⁠’, respectively), or a data frame with amino acid compositions of proteins (for ret equal to ‘⁠count⁠’) with columns corresponding to those in thermo$protein.

See Also

seq2aa, like count.aa, counts amino acids in a user-input sequence, but returns a data frame in the format of thermo()$protein.

Examples


## Reading a protein FASTA file
# The path to the file
file <- system.file("extdata/protein/EF-Tu.aln", package = "CHNOSZ")
# Read the sequences, and print the first one
read.fasta(file, ret = "seq")[[1]]
# Count the amino acids in the sequences
aa <- read.fasta(file)
# Compute lengths (number of amino acids)
protein.length(aa)

## Not run: 
## Count amino acids in a sequence
count.aa("GGSGG")
# Warnings are issued for unrecognized characters
atest <- count.aa("WhatAmIMadeOf?")
# There are 3 "A" (alanine)
atest[, "A"]

## End(Not run)

[Package CHNOSZ version 2.1.0 Index]