This dataset contains INSDC sequence records not associated with environmental sample identifiers or host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with search parameters: `environmental_sample=False & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
GBIF url: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Citation: European Bioinformatics Institute (EMBL-EBI), GBIF Helpdesk (2024). INSDC Sequences. Version 1.82. European Nucleotide Archive (EMBL-EBI). Occurrence dataset https://doi.org/10.15468/sbmztx accessed via GBIF.org on 2024-04-29.
物種 | 原始紀錄物種 | 類群 | 日期 | 行政區 | 資料集 | |
---|---|---|---|---|---|---|
Rondibilis horienis horienis 埔里刺翅天牛 | Rondibilis horienis horienis | 甲蟲類 | 2008-05-26 | INSDC Sequences | ||
Anneslea 茶梨屬 | Anneslea fragrans var. lanceolata | 被子植物 | INSDC Sequences | |||
Haemadipsa rjukjuana 琉球山蛭 | Haemadipsa rjukjuana | 其他無脊椎 | 2004-09-20 | INSDC Sequences | ||
Woodwardia unigemmata (Makino) Nakai 頂芽狗脊蕨 | Woodwardia unigemmata | 蕨類 | INSDC Sequences | |||
Symmetrospora marina | 2010-04-24 | INSDC Sequences | ||||
Thunnus orientalis 太平洋黑鮪 | Thunnus orientalis | 魚類 | INSDC Sequences | |||
Marchantia paleacea Bertol. 粗裂地錢 | Marchantia paleacea | 苔蘚 | INSDC Sequences | |||
Machilus japonicus Siebold & Zucc. 假長葉楠 | Machilus japonica var. kusanoi | 被子植物 | INSDC Sequences | |||
Heptapleurum heptaphyllum (L.) Y.F.Deng 鵝掌柴 | Heptapleurum heptaphyllum | 被子植物 | INSDC Sequences | |||
Podocarpus nakaii Hayata 桃實百日青 | Podocarpus nakaii | 裸子植物 | INSDC Sequences | |||
Entomortierella jenkinii | 2014-02-23 | INSDC Sequences | ||||
Begonia 秋海棠屬 | Begonia x chungii | 被子植物 | INSDC Sequences | |||
Debregeasia orientalis C.J.Chen 水麻 | Debregeasia orientalis | 被子植物 | INSDC Sequences | |||
Proteus sp. AS1 | 2014-10-09 | INSDC Sequences | ||||
Nebria niitakana | Nebria niitakana | 甲蟲類 | 2011-08-03 | INSDC Sequences | ||
Leistus | Leistus sp. 2 YMW-2015 | 甲蟲類 | 2011-12-15 | INSDC Sequences | ||
Cynoglossus interruptus 斷線舌鰨 | Cynoglossus interruptus | 魚類 | INSDC Sequences | |||
Champsodon snyderi 斯氏鱷齒魚 | Champsodon snyderi | 魚類 | INSDC Sequences | |||
Mammalian orthoreovirus 1 | INSDC Sequences | |||||
Phytophthora 疫黴屬 | Phytophthora sp. occultans-like | 原藻類 | 2013-01-01 | INSDC Sequences | ||
Aegus nakaneorum 姬肥角鍬形蟲 | Aegus nakaneorum | 甲蟲類 | INSDC Sequences | |||
Afronurus | Afronurus sp. AsH5 | 其他昆蟲 | 2017-01-01 | INSDC Sequences | ||
Parahelice daviei | 2016-05-24 | INSDC Sequences | ||||
Conocephalum conicum (L.) Dumort. 蛇蘚 | Conocephalum conicum | 苔蘚 | 2018-01-01 | INSDC Sequences | ||
Dengue virus | 2002-01-01 | INSDC Sequences |