This function checks a SHARK data frame against the required and recommended
fields defined for a specific datatype. It verifies that all required fields
are present and contain non-empty values. If level = "warning", it
also checks for recommended fields and empty values within them.
Usage
check_fields(
data,
datatype,
level = "error",
stars = 1,
bacterioplankton_subtype = "abundance",
field_definitions = .field_definitions
)Arguments
- data
A data frame containing SHARK data to be validated.
- datatype
A string giving the SHARK datatype to validate against. Must exist as a name in the provided
field_definitions.- level
Character string, either
"error"or"warning". If"error", only required fields are validated. If"warning", recommended fields are also checked and reported as warnings.- stars
Integer. Maximum number of "" levels to include. Default = 1 (only single ""). For example,
stars = 2includes "" and "**",stars = 3includes "", "", and "*".- bacterioplankton_subtype
Character. For "Bacterioplankton" only: either "abundance" (default) or "production". Ignored for other datatypes.
- field_definitions
A named list of field definitions. Each element should contain two character vectors:
requiredandrecommended. Defaults to the package's built-inSHARK4R:::.field_definitions. Alternatively, the latest definitions can be loaded directly from the official SHARK4R GitHub repository usingload_shark4r_fields().
Value
A tibble with the following columns:
- level
Either
"error"or"warning".- field
The name of the field that triggered the check.
- row
Row number(s) in
datawhere the issue occurred, orNAif the whole field is missing.- message
A descriptive message explaining the problem.
The tibble will be empty if no problems are found.
Details
Note: A single "*" marks required fields in the standard SHARK template. A double "**" is often used to specify columns required for national monitoring only. For more information, see: https://www.smhi.se/data/hav-och-havsmiljo/datavardskap-oceanografi-och-marinbiologi/leverera-data
Field definitions for SHARK data can be loaded in two ways:
From the SHARK4R package bundle (default): The package contains a built-in object,
.field_definitions, which stores required and recommended fields for each datatype.From GitHub (latest official version): To use the most up-to-date field definitions, you can load them directly from the SHARK4R-statistics repository:
defs <- load_shark4r_fields() check_fields(my_data, "Phytoplankton", field_definitions = defs)
Delivery-format (all-caps) data:
If the column names in data are all uppercase (e.g. SDATE), check_fields() assumes
the dataset follows the official SHARK delivery template. In this case:
Required fields are determined from the delivery template using
get_delivery_template()andfind_required_fields().Recommended fields are ignored because the delivery templates do not define them.
The function validates that all required columns exist and contain non-empty values.
This ensures that both internal SHARK4R datasets (with camelCase or snake_case columns)
and official delivery files (ALL_CAPS columns) are validated correctly using the appropriate rules.
Stars in the template
Leading asterisks in the delivery template indicate required levels:
* = standard required column
* = required for national monitoring
Other symbols = additional requirement level
The stars parameter in check_fields() controls how many levels of required
columns to include.
See also
load_shark4r_fields for fetching the latest field definitions from GitHub,
get_delivery_template for downloading delivery templates from SMHI's website.
Examples
# Example 1: Using built-in field definitions for "Phytoplankton"
df_phyto <- data.frame(
visit_date = "2023-06-01",
sample_id = "S1",
scientific_name = "Skeletonema marinoi",
value = 123
)
check_fields(df_phyto, "Phytoplankton", level = "warning")
#> # A tibble: 42 × 4
#> level field row message
#> <chr> <chr> <lgl> <chr>
#> 1 error visit_year NA Required field visit_year is missing
#> 2 error station_name NA Required field station_name is missing
#> 3 error reported_station_name NA Required field reported_station_name is m…
#> 4 error sample_project_name_sv NA Required field sample_project_name_sv is …
#> 5 error sample_orderer_name_sv NA Required field sample_orderer_name_sv is …
#> 6 error platform_code NA Required field platform_code is missing
#> 7 error sample_date NA Required field sample_date is missing
#> 8 error sample_time NA Required field sample_time is missing
#> 9 error sample_latitude_dd NA Required field sample_latitude_dd is miss…
#> 10 error sample_longitude_dd NA Required field sample_longitude_dd is mis…
#> # ℹ 32 more rows
# Example 2: Load latest definitions from GitHub and use them
if (FALSE) { # \dontrun{
defs <- load_shark4r_fields()
check_fields(df_phyto, "Phytoplankton", field_definitions = defs)
} # }
# Example 3: Custom datatype with required + recommended fields
defs <- list(
ExampleType = list(
required = c("id", "value"),
recommended = "comment"
)
)
df_ok <- data.frame(id = 1, value = "x", comment = "ok")
check_fields(df_ok, "ExampleType", level = "warning", field_definitions = defs)
#> # A tibble: 0 × 0
