This dataset contains INSDC sequence records not associated with environmental sample identifiers or host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with search parameters: `environmental_sample=False & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
GBIF url: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Citation: European Bioinformatics Institute (EMBL-EBI), GBIF Helpdesk (2024). INSDC Sequences. Version 1.82. European Nucleotide Archive (EMBL-EBI). Occurrence dataset https://doi.org/10.15468/sbmztx accessed via GBIF.org on 2024-04-29.
物種 | 原始紀錄物種 | 類群 | 日期 | 行政區 | 資料集 | |
---|---|---|---|---|---|---|
Haemadipsa picta 彩紋山蛭 | Haemadipsa picta | 其他無脊椎 | 2006-10-15 | INSDC Sequences | ||
Termitomyces 蟻傘屬 | Termitomyces sp. T984 | 真菌類 | 2009-06-27 | INSDC Sequences | ||
Saccostrea 囊牡蠣屬 | Saccostrea sp. 9 STH-2012 | 蝸牛與貝類 | 2009-02-15 | INSDC Sequences | ||
Drosophila 果蠅屬 | Drosophila lucipennis | 其他昆蟲 | 2012-01-01 | INSDC Sequences | ||
Macrothele taiwanensis 臺灣長尾蛛 | Macrothele taiwanensis | 蜘蛛類 | INSDC Sequences | |||
Helicia formosana Hemsl. 山龍眼 | Helicia formosana | 被子植物 | INSDC Sequences | |||
Reevesia formosana Sprague 臺灣梭羅樹 | Reevesia formosana | 被子植物 | INSDC Sequences | |||
Turpinia ternata Nakai 三葉山香圓 | Turpinia ternata | 被子植物 | INSDC Sequences | |||
Submyotodon latirostris 寬吻鼠耳蝠 | Submyotodon latirostris | 哺乳類 | INSDC Sequences | |||
Begonia nantoensis M.J.Lai & N.J.Chung 南投秋海棠 | Begonia nantoensis | 被子植物 | INSDC Sequences | |||
Leistus nokoensis | Leistus nokoensis | 甲蟲類 | 2010-07-20 | INSDC Sequences | ||
Leistus nokoensis | Leistus nokoensis | 甲蟲類 | 2011-04-08 | INSDC Sequences | ||
Strophidon sathete 長鯙 | Strophidon sathete | 魚類 | INSDC Sequences | |||
Auxis rochei; Auxis rochei rochei 圓花鰹; 圓花鰹 | Auxis rochei | 魚類 | INSDC Sequences | |||
Gentiana arisanensis Hayata 阿里山龍膽 | Gentiana arisanensis | 被子植物 | INSDC Sequences | |||
Fusarium falciforme | Fusarium falciforme | 真菌類 | INSDC Sequences | |||
Mazus goodenifolius (Hornem.) Pennell 薄葉通泉草 | Mazus goodenifolius | 被子植物 | INSDC Sequences | |||
Neolucanus 圓翅鍬形蟲屬 | Neolucanus maximus vendli | 甲蟲類 | INSDC Sequences | |||
Mandarella flaviventris 黃腹瘦葉蚤 | Mandarella flaviventris | 甲蟲類 | INSDC Sequences | |||
Leptarma kui | Leptarma kui | 蝦蟹類 | 2015-09-30 | INSDC Sequences | ||
Metaplax 長方蟹屬 | Metaplax shenii | 蝦蟹類 | 2018-06-29 | INSDC Sequences | ||
Spodoptera 夜盜蛾屬 | Spodoptera frugiperda | 蛾類 | INSDC Sequences | |||
Cacopsylla chinensis 中國梨木蝨 | Cacopsylla chinensis | 其他昆蟲 | INSDC Sequences | |||
Fusarium 鐮孢菌屬 | Fusarium mangiferae | 真菌類 | 2016-01-01 | INSDC Sequences | ||
Gnetum 買麻藤屬 | Gnetum hainanense | 裸子植物 | 2020-04-01 | INSDC Sequences |