3  Assessing orthogroup inference in public databases

Here, we will use the protein domain-based approach in cogeqc to assess gene families from different sources, namely:


set.seed(123) # for reproducibility
source(here("code", "utils.R"))

3.1 Calculating orthogroup scores

To make comparison possible, we will Arabidopsis thaliana domain annotation as a proxy, as this species is present in all of the aforementioned databases. For that, we will use the function calculate_H() from cogeqc.

Orthogroups assignments from OrthoDB, eggNOG, InParanoid, PhylomeDB, and HOGENOM will be obtained from UniProt.

3.1.1 PLAZA Dicots 5.0

Below, we will obtain orthogroups and A. thaliana’s domain annotation from PLAZA 5.0, and then we will calculate homogeneity scores for each orthogroup.

# Obtain gene families from PLAZA
fams_plaza <- readr::read_tsv(
    ), show_col_types = FALSE, skip = 2
) %>%
    filter(species == "ath") %>%
names(fams_plaza) <- c("Orthogroup", "Species", "Gene")
    Orthogroup Species      Gene
1 HOM05D000001     ath AT1G02310
2 HOM05D000001     ath AT1G03510
3 HOM05D000001     ath AT1G03540
4 HOM05D000001     ath AT1G04020
5 HOM05D000001     ath AT1G04840
6 HOM05D000001     ath AT1G05750
# Obtain domain anotation for A. thaliana
ath_interpro <- readr::read_tsv(
    ), show_col_types = FALSE, skip = 8
) %>%
names(ath_interpro) <- c("Gene", "Annotation")
# A tibble: 6 × 2
  Gene      Annotation
  <chr>     <chr>     
1 AT1G01010 IPR036093 
2 AT1G01010 IPR003441 
3 AT1G01010 IPR036093 
4 AT1G01020 IPR007290 
5 AT1G01020 IPR007290 
6 AT1G01030 IPR003340 
# Combining everything and calculating homogeneity scores
fam_df_plaza <- merge(fams_plaza, ath_interpro)
       Gene   Orthogroup Species Annotation
1 AT1G01010 HOM05D000010     ath  IPR036093
2 AT1G01010 HOM05D000010     ath  IPR003441
3 AT1G01010 HOM05D000010     ath  IPR036093
4 AT1G01020 HOM05D006082     ath  IPR007290
5 AT1G01020 HOM05D006082     ath  IPR007290
6 AT1G01030 HOM05D000466     ath  IPR015300
H_summary <- function(ortho_df = NULL) {
    H <- calculate_H(ortho_df)
    mean_H <- round(mean(H$Score), 2)
    median_H <- round(median(H$Score), 2)
    result_list <- list(H = H, mean_score = mean_H, median_score = median_H)

H_plaza <- H_summary(fam_df_plaza)
    Orthogroup     Score
1 HOM05D000001  283.3132
2 HOM05D000002  129.9598
3 HOM05D000003  889.1268
4 HOM05D000004    0.0000
5 HOM05D000005 1135.8799
6 HOM05D000006 2820.8337

3.1.2 OrthoDB, eggNOG, and HOGENOM

Orthogroup assignments from these databases will be obtained from UniProt (Consortium 2021).

# Get list of proteins - from primary transcripts only
ath_proteome <- Biostrings::readAAStringSet(
ath_proteins <- names(ath_proteome)
ath_proteins <- sapply(strsplit(ath_proteins, split = "\\|"), `[`, 2)

# Extract phylogenomic information for all genes
source(here::here("code", "utils.R"))
fams_uniprot <- extract_ogs_uniprot(ath_proteins)

fams_orthodb <- fams_uniprot[, c("Gene", "OrthoDB")] %>% drop_na()
fams_eggnog <- fams_uniprot[, c("Gene", "eggNOG")] %>% drop_na()
fams_hogenom <- fams_uniprot[, c("Gene", "HOGENOM")] %>% drop_na()

#----Calculate homogeneity scores for each database-----------------------------
# OrthoDB
fams_df_orthodb <- merge(fams_orthodb, ath_interpro)
names(fams_df_orthodb)[2] <- "Orthogroup"
H_orthodb <- H_summary(fams_df_orthodb)

# eggNOG
fams_df_eggnog <- merge(fams_eggnog, ath_interpro)
names(fams_df_eggnog)[2] <- "Orthogroup"
H_eggnog <- H_summary(fams_df_eggnog)

fams_df_hogenom <- merge(fams_hogenom, ath_interpro)
names(fams_df_hogenom)[2] <- "Orthogroup"
H_hogenom <- H_summary(fams_df_hogenom)

3.2 Comparing homogeneity scores

Finally, let’s compare homogeneity scores and visualize their distributions. First, let’s combine all data frames of homogeneity scores into a single data frame.

H_combined <- bind_rows(
    H_plaza$H %>% mutate(Source = "PLAZA"),
    H_orthodb$H %>% mutate(Source = "OrthoDB"),
    H_eggnog$H %>% mutate(Source = "eggNOG"),
    H_hogenom$H %>% mutate(Source = "HOGENOM")

    file = here::here("products", "result_files", "H_combined.rda"),
    compress = "xz"

Now, let’s compare the distributions of homogeneity scores for each database to see if there are any differences. For that, we will calculate P-values from a Wilcoxon test with Wicoxon effect sizes (r). The Wilcoxon effect size is calculated as the Z statistic divided by the square root of the sample size.

# Scale scores to maximum, so that they range from 0 to 1
H_combined$Score <- H_combined$Score / max(H_combined$Score)
    Orthogroup      Score Source
1 HOM05D000001 0.10043599  PLAZA
2 HOM05D000002 0.04607143  PLAZA
3 HOM05D000003 0.31520000  PLAZA
4 HOM05D000004 0.00000000  PLAZA
5 HOM05D000005 0.40267523  PLAZA
6 HOM05D000006 1.00000000  PLAZA
# Quick exploration of means and medians
H_combined %>%
    group_by(Source) %>%
    summarise(mean = mean(Score), median = median(Score))
# A tibble: 4 × 3
  Source   mean median
  <chr>   <dbl>  <dbl>
1 HOGENOM 0.603  0.609
2 OrthoDB 0.578  0.567
3 PLAZA   0.610  0.6  
4 eggNOG  0.565  0.546
# Compare homogeneity scores - all vs all
db_wilcox <- compare(H_combined, "Score ~ Source")

db_wilcox |>
    filter_comparison() |>
        caption = "Mann-Whitney U test for differences in orthogroup scores with Wilcoxon effect sizes.",
        digits = 10
Mann-Whitney U test for differences in orthogroup scores with Wilcoxon effect sizes.
group1 group2 n1 n2 padj effsize magnitude
eggNOG HOGENOM 3092 3257 0.0e+00 0.11102956 small
eggNOG OrthoDB 3092 3201 8.5e-09 0.07197679 small
eggNOG PLAZA 3092 3503 0.0e+00 0.09434683 small
HOGENOM OrthoDB 3257 3201 0.0e+00 0.09071787 small
HOGENOM PLAZA 3257 3503 3.0e-03 0.03402611 small
OrthoDB PLAZA 3201 3503 7.0e-10 0.07526911 small

We can see that there are diffences in mean. In summary:

  1. eggNOG orthogroups have lower scores than every other source

  2. HOGENOM orthogroups have higher scores than OrthoDB, but lower than PLAZA.

  3. PLAZA orthogroup scores are higher than every other database.

However, the effect sizes are very small, suggesting that significant differences could be due to large sample sizes, as P-values are highly affected by sample sizes.

Now, let’s visualize the distributions with significant differences highlighted. Here, we will only display comparison bars for comparisons with P < 0.05 and effect sizes > 0.1.

# Comparisons to be made
comps <- list(
    c("HOGENOM", "eggNOG")

# Change order of levels according to comparison results
H_combined$Source <- factor(
    H_combined$Source, levels = rev(c(
        "PLAZA", "HOGENOM", "OrthoDB", "eggNOG"

# Visualize distributions with significant differences highlighted
distros <- ggviolin(
    H_combined, y = "Score", x = "Source", 
    orientation = "horiz", trim = TRUE, add = c("boxplot", "mean"), 
    fill = "Source", add.params = list(fill = "white"), palette = "jama"
) +
        comparisons = comps,
        label = "p.signif",
        method = "wilcox.test"
    ) +
    theme(legend.position = "none") +
    labs(y = "Scaled homogeneity scores", x = "Source of orthogroups",
         title = "Distribution of mean homogeneity scores for orthogroups",
         subtitle = "Scores were calculated based on *A. thaliana* genes") +
    theme(plot.subtitle = ggtext::element_markdown())


Distribution of mean orthogroup scores.

To conclude, despite some significant differences, all databases perform equally well in their orthogroup definition. The observed differences in means could be due to large sample sizes, as indicated by very low effect sizes, and to the different species composition of the database.

Session info

This document was created under the following conditions:

