Process sequence data

Usage

process_input(
  seq = NULL,
  annotation = NULL,
  gene_field = "gene_id",
  filter_annotation = FALSE
)

seq: A list of AAStringSet objects, each list element containing protein sequences for a given species. This list must have names (not NULL), and names of each list element must match the names of list elements in annotation.
annotation: A GRangesList, CompressedGRangesList, or list of GRanges with the annotation for the sequences in seq. This list must have names (not NULL), and names of each list element must match the names of list elements in seq.
gene_field: Character, name of the column in the GRanges objects that contains gene IDs. Default: "gene_id".
filter_annotation: Logical indicating whether annotation should be filtered to keep only genes that are also in seq. This is particularly useful if users want to remove information on non-protein coding genes from annotation, since such genes are typically not present in sets of whole-genome protein sequences. Default: FALSE.

A list of 2 elements:

seq: The processed list of AAStringSet objects from seq.
annotation: The processed GRangesList or CompressedGRangesList object from annotation.

This function processes the input sequences and annotation to:

Remove whitespace and anything after it in sequence names (i.e., names(seq[[x]]), which is equivalent to FASTA headers), if there is any.
Add a unique species identifier to sequence names. The species identifier consists of the first 3-5 strings of the element name. For instance, if the first element of the seq list is named "Athaliana", each sequence in it will have an identifier "Atha_" added to the beginning of each gene name (e.g., Atha_AT1G01010).
If sequences have an asterisk (*) representing stop codon, remove it.
Add a unique species identifier (same as above) to gene and chromosome names of each element of the annotation GRangesList/CompressedGRangesList.
Filter each element of the annotation GRangesList/CompressedGRangesList to keep only seqnames, ranges, and gene ID.

data(annotation)
data(proteomes)
seq <- proteomes
clean_data <- process_input(seq, annotation)