Collapse protein IDs into gene IDs in sequence names of AAStringSet objects
Source:R/02_data_preprocessing.R
collapse_protein_ids.Rd
This function can be used if the sequence names of the AAStringSet objects contain protein IDs instead of gene IDs (what syntenet requires)
Arguments
- seq
A list of AAStringSet objects, each list element containing protein sequences for a given species. This list must have names (not NULL), and names of each list element must match the names of list elements in protein2gene.
- protein2gene
A list of 2-column data frames containing protein-to-gene ID correspondences, where the first column contains protein IDs, and the second column contains gene IDs. Names of list elements must match names of seq.
Details
For each species, this function will replace the protein IDs in sequence names with gene IDs using the protein-to-gene correspondence table in protein2gene. After replacing protein IDs with gene IDs, if there are multiple sequences with the same gene ID (indicating different isoforms of the same gene), only the longest sequence is kept, so that the number of sequences is not greater than the number of genes.
Examples
# Load data
seq_path <- system.file(
"extdata", "RefSeq_parsing_example", package = "syntenet"
)
seq <- fasta2AAStringSetlist(seq_path)
annot <- gff2GRangesList(seq_path)
# Clean sequence names
names(seq$Aalosa) <- gsub(" .*", "", names(seq$Aalosa))
# Create a correspondence data frame
cor_df <- as.data.frame(annot$Aalosa[annot$Aalosa$type == "CDS", ])
cor_df <- cor_df[, c("Name", "gene")]
# Create a list of correspondence data frames
protein2gene <- list(Aalosa = cor_df)
# Collapse IDs
new_seqs <- collapse_protein_ids(seq, protein2gene)