Process sequence data
Usage
process_input(
seq = NULL,
annotation = NULL,
gene_field = "gene_id",
filter_annotation = FALSE
)
Arguments
- seq
A list of AAStringSet objects, each list element containing protein sequences for a given species. This list must have names (not NULL), and names of each list element must match the names of list elements in annotation.
- annotation
A GRangesList, CompressedGRangesList, or list of GRanges with the annotation for the sequences in seq. This list must have names (not NULL), and names of each list element must match the names of list elements in seq.
- gene_field
Character, name of the column in the GRanges objects that contains gene IDs. Default: "gene_id".
- filter_annotation
Logical indicating whether annotation should be filtered to keep only genes that are also in seq. This is particularly useful if users want to remove information on non-protein coding genes from annotation, since such genes are typically not present in sets of whole-genome protein sequences. Default: FALSE.
Value
A list of 2 elements:
- seq
The processed list of AAStringSet objects from seq.
- annotation
The processed GRangesList or CompressedGRangesList object from annotation.
Details
This function processes the input sequences and annotation to:
Remove whitespace and anything after it in sequence names (i.e.,
names(seq[[x]])
, which is equivalent to FASTA headers), if there is any.Add a unique species identifier to sequence names. The species identifier consists of the first 3-5 strings of the element name. For instance, if the first element of the seq list is named "Athaliana", each sequence in it will have an identifier "Atha_" added to the beginning of each gene name (e.g., Atha_AT1G01010).
If sequences have an asterisk (*) representing stop codon, remove it.
Add a unique species identifier (same as above) to gene and chromosome names of each element of the annotation GRangesList/CompressedGRangesList.
Filter each element of the annotation GRangesList/CompressedGRangesList to keep only seqnames, ranges, and gene ID.