Where can I publish a paper describing my Bioconductor package?

bioinformatics
reproducible research
bioconductor
scientific writing
rstats
Check out where Bioc developers have published their papers
Author

Fabrício Almeida-Silva

Published

June 12, 2023

Motivation

When I developed BioNERO, my first R/Bioconductor package, I didn’t know to which journals I could submit the paper describing it. Since then, I’ve seen many other R developers that have faced the same issue. To help solve this problem, here I will guide you on how to do some web scraping to find out the main journals where Bioc developers publish papers describing their packages.

Extracting citation information from Bioconductor’s browsable code base

Bioconductor offers a browsable code base that lets users explore git repositories and search code in all Bioconductor packages. If we go to the Code Search page and search journal f:CITATION, we will get a list of all CITATION files (where developers include citation information for their packages) that include the string “journal”.

Knowing that, we can do some web scraping using the rvest package to extract such information for all packages and parse it into a nicely-formatted data frame.

# Load required packages
library(tidyverse)
library(rvest)

# Get URL of the search "journal f:CITATION"
url <- "https://code.bioconductor.org/search/search?q=journal%20f%3aCITATION"
n <- 2000 # number of files to show

# Get list of tables containing journal names
journal_list <- rvest::read_html(paste0(url, "&num=", n)) |>
    rvest::html_table()

# Parse list of data frames into a large, tidy list
journal_df <- Reduce(rbind, lapply(seq_along(journal_list), function(x) {
    
    df <- journal_list[[x]]
    
    # Package name
    pkg <- gsub(":.*", "", names(df)[1])
    
    names(df) <- "entries"
    df <- as.data.frame(df) |> 
        # 1) Keep only rows containing 'journal=' or 'journal ='
        filter(str_detect(entries, "journal\\s*=")) |>
        # 2) Get journal name (remove quotation marks, whitespace, commas, etc)
        mutate(
            journal = str_replace_all(entries, ".*=", ""),
            journal = str_replace_all(journal, '\\\"', ''),
            journal = str_replace_all(journal, "'", ''),
            journal = str_replace_all(journal, "\\.", ""),
            journal = str_squish(journal),
            journal = str_to_upper(journal),
            journal = str_replace_all(journal, ",$", ""),
            journal = str_replace_all(journal, "\\)", ""),
            journal = str_replace_all(journal, "\\(", ""),
            journal = str_replace_all(journal, "\\{", ""),
            journal = str_replace_all(journal, "\\}", "")
        ) |>
        select(journal)
    
    # Add a column named `package` containing package name
    if(nrow(df) > 0) {
        df <- df |>
            mutate(package = pkg)
    }
    
    return(df)
}))

# Taking a look at the first rows
head(journal_df)
             journal           package
1     BIOINFORMATICS        cytomapper
2 SCIENTIFIC REPORTS      IsoCorrectoR
3     BIOINFORMATICS             Rtpca
4     BIOINFORMATICS transcriptogramer
5     BIOINFORMATICS               ACE
6     BIOINFORMATICS          limmaGUI

Now, because CITATION files are created manually by developers, a big (and expected) problem is the lack of standardization. This leads to different developers referring to the same journal by different names (e.g., Nature Methods and Nat Methods, Nucleic Acids Research and NAR, etc). You can see that yourself by executing sort(unique(journal_df$journal)). While I can never expect to fix this problem completely (especially if you are reading this post in the future and new packages have been added), below is my attempt to fix most of the inconsistencies. I will probably miss some strange exceptions, but I guess I can live with it, right?

# 'Journals' to remove (these are not actually journals)
to_remove <- c(
    "", "07", "1", "10", as.character(2010:2023), "2022-2032",
    "IN REVIEW", "IN PREPARATION", "JOURNAL", "MANUSCRIPT IN PREPARATION",
    "TBA", "TBD", "UNDER REVIEW", "UNIVERSITY OF REGENSBURG",
    "BIOCONDUCTOR", "SUBMITTED", "MEDRXIV", "BIORXIV", "PREPRINT", "ARXIV"
)

# Standardize names
journal_df_clean <- journal_df |>
    filter(!journal %in% to_remove) |>
    mutate(
        journal = str_replace_all(journal, c(
            "ALBANY NY.*" = "",
            "ALGORITHMS MOL BIO" = "ALGORITHMS FOR MOLECULAR BIOLOGY",
            "ANAL CHEM" = "ANALYTICAL CHEMISTRY",
            "ANN APPL STAT" = "ANNALS OF APPLIED STATISTICS",
            "PREPRINT.*" = "",
            "BIONFORMATICS JOURNAL" = "BIOINFORMATICS",
            "OXFORD, ENGLAND" = "",
            "ACCEPTED" = "",
            "BMC SYST BIOL" = "BMC SYSTEMS BIOLOGY",
            "COMPUT METHODS PROGRAMS BIOMED" = "COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE",
            "CYTOMETRY A" = "CYTOMETRY PART A",
            "EPIGENETICS CHROMATIN" = "EPIGENETICS & CHROMATIN",
            "F1000.*" = "F1000RESEARCH",
            "FRONT BIOL" = "FRONTIERS IN BIOLOGY",
            "GENOME BIOL$" = "GENOME BIOLOGY",
            "GENOME RES$" = "GENOME RESEARCH",
            ", CODE SNIPPETS" = "",
            ", SERIES B" = "",
            "J MACH LEARN RES" = "JOURNAL OF MACHINE LEARNING RESEARCH",
            "METHODS MOL BIO" = "METHODS IN MOLECULAR BIOLOGY",
            "MOL SYST BIOL" = "MOLECULAR SYSTEMS BIOLOGY",
            "NAT BIOTECH.*" = "NATURE BIOTECHNOLOGY",
            "NAT COMM.*" = "NATURE COMMUNICATIONS",
            "NAT GENET" = "NATURE GENETICS",
            "NAT IMMUNOL" = "NATURE IMMUNOLOGY",
            "NAT METH" = "NATURE METHODS",
            "NPG SYST BIOL APPL" = "NPG SYSTEMS BIOLOGY AND APPLICATIONS",
            " GKV873" = "",
            "NUCL ACIDS RES$" = "NUCLEIC ACIDS RESEARCH",
            "NUCLEIC ACIDS RES$" = "NUCLEIC ACIDS RESEARCH",
            "DATABASE ISSUE" = "",
            "OXFORD BIOINFORMATICS" = "BIOINFORMATICS",
            "PLOS COMPUT BIOL" = "PLOS COMPUTATIONAL BIOLOGY",
            "PLOS COMPUTAT BIOL" = "PLOS COMPUTATIONAL BIOLOGY",
            "PROC NATL ACAD SCI.*" = "PNAS",
            "PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES.*" = "PNAS",
            "STAT APPL GENET MOL BIOL" = "STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY"
        )
        ),
        journal = str_squish(journal)
    )

# Taking a look at the first rows
head(journal_df_clean)
             journal           package
1     BIOINFORMATICS        cytomapper
2 SCIENTIFIC REPORTS      IsoCorrectoR
3     BIOINFORMATICS             Rtpca
4     BIOINFORMATICS transcriptogramer
5     BIOINFORMATICS               ACE
6     BIOINFORMATICS          limmaGUI

The final data frame of packages and journals where they published their papers can be explored below:

Summary stats

Now, let’s count the frequency of packages in each journal and show the top 20 journals based number of the number of papers associated with Bioc packages.

# Get top 20 journals in number of papers associated with Bioc pkgs
citation_stats <- journal_df_clean %>%
    count(journal) %>%
    arrange(-n) %>%
    slice_head(n = 20)

citation_stats
                                                      journal   n
1                                              BIOINFORMATICS 303
2                                          BMC BIOINFORMATICS 101
3                                      NUCLEIC ACIDS RESEARCH  71
4                                              GENOME BIOLOGY  61
5                                               F1000RESEARCH  34
6                                              NATURE METHODS  30
7                                                BMC GENOMICS  25
8                                       NATURE COMMUNICATIONS  24
9                                  PLOS COMPUTATIONAL BIOLOGY  22
10                                                   PLOS ONE  22
11                                            GENOME RESEARCH  13
12                                       ANALYTICAL CHEMISTRY  11
13                                              BIOSTATISTICS  10
14                                BRIEFINGS IN BIOINFORMATICS   9
15                                       NATURE BIOTECHNOLOGY   9
16                               JOURNAL OF PROTEOME RESEARCH   8
17                                  MOLECULAR SYSTEMS BIOLOGY   8
18                                            NATURE GENETICS   8
19                                                       PNAS   8
20 STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY   7

Exploring it visually:

# Read figure with Bioc logo
bioc_logo <- png::readPNG(
    here::here("blog", "2022-01-03-bioc_publications", "featured-bioc.png"), 
    native = TRUE
)

# Define plotting params
last_updated <- format(Sys.Date(), "%Y-%m-%d")
xmax <- max(citation_stats$n) + 30
xmax <- round(xmax / 10) * 10

# Plot data
ggplot(citation_stats, aes(x = n, y = reorder(journal, n))) +
    geom_col() +
    geom_text(aes(label = n), hjust = -0.3) +
    xlim(0, xmax) +
    labs(
        title = "Where are papers associated with BioC packages published?",
        subtitle = paste0("Last update: ", last_updated),
        x = "Number of papers", y = ""
    ) +
    theme_bw() +
    patchwork::inset_element(
        bioc_logo,
        left = 0.5,
        top = 0.55,
        right = 0.95,
        bottom = 0.3
    ) +
    theme_void()

And voilà! In case you want to explore the whole table, here it is:

Session information

This post was created under the following conditions:

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21)
 os       Ubuntu 20.04.5 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Brussels
 date     2023-06-13
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 BiocManager   1.30.20 2023-02-24 [1] CRAN (R 4.3.0)
 BiocStyle     2.28.0  2023-04-25 [1] Bioconductor
 bslib         0.4.2   2022-12-16 [1] CRAN (R 4.3.0)
 cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crosstalk     1.2.0   2021-11-04 [1] CRAN (R 4.3.0)
 curl          5.0.0   2023-01-12 [1] CRAN (R 4.3.0)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 DT            0.27    2023-01-17 [1] CRAN (R 4.3.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
 evaluate      0.20    2023-01-17 [1] CRAN (R 4.3.0)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 httr          1.4.5   2023-02-24 [1] CRAN (R 4.3.0)
 jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.0)
 jsonlite      1.8.4   2022-12-06 [1] CRAN (R 4.3.0)
 knitr         1.42    2023-01-25 [1] CRAN (R 4.3.0)
 labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 patchwork     1.1.2   2022-08-19 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 png           0.1-8   2022-11-29 [1] CRAN (R 4.3.0)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.21    2023-03-26 [1] CRAN (R 4.3.0)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
 rvest       * 1.0.3   2022-08-19 [1] CRAN (R 4.3.0)
 sass          0.4.5   2023-01-24 [1] CRAN (R 4.3.0)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.2   2023-04-19 [1] CRAN (R 4.3.0)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
 xml2          1.3.4   2023-04-27 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /home/faalm/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────