Skip to contents

This function processes a character vector of scientific names, splitting them into genus and species components. It handles binomial names (e.g., "Homo sapiens"), removes undesired descriptors (e.g., 'Cfr.', 'cf.', 'sp.', 'spp.'), and manages cases involving varieties, subspecies, or invalid species names. Special characters and whitespace are handled appropriately.

Usage

parse_scientific_names(
  scientific_name,
  remove_undesired_descriptors = TRUE,
  remove_subspecies = TRUE,
  remove_invalid_species = TRUE,
  encoding = "UTF-8"
)

Arguments

scientific_name

A character vector containing scientific names, which may include binomials, additional descriptors, or varieties.

remove_undesired_descriptors

Logical, if TRUE, undesired descriptors (e.g., 'Cfr.', 'cf.', 'colony', 'cells', etc.) are removed. Default is TRUE.

remove_subspecies

Logical, if TRUE, subspecies/variety descriptors (e.g., 'var.', 'subsp.', 'f.', etc.) are removed. Default is TRUE.

remove_invalid_species

Logical, if TRUE, invalid species names (e.g., 'sp.', 'spp.') are removed. Default is TRUE.

encoding

A string specifying the encoding to be used for the input names (e.g., 'UTF-8'). Default is 'UTF-8'.

Value

A data.frame with two columns:

  • genus: Contains the genus names.

  • species: Contains the species names (empty if unavailable or invalid). Invalid descriptors like 'sp.', 'spp.', and numeric entries are excluded from the 'species' column.

Examples

# Example with a vector of scientific names
scientific_names <- c("Skeletonema marinoi", "Cf. Azadinium perforatum", "Gymnodinium sp.",
                      "Melosira varians", "Aulacoseira islandica var. subarctica")
result <- parse_scientific_names(scientific_names)

# Check the resulting data frame
print(result)
#>         genus              species
#> 1 Skeletonema              marinoi
#> 2   Azadinium           perforatum
#> 3 Gymnodinium                     
#> 4    Melosira              varians
#> 5 Aulacoseira islandica subarctica