This dataset contains INSDC sequence records not associated with environmental sample identifiers or host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with search parameters: `environmental_sample=False & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
GBIF url: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Citation: European Bioinformatics Institute (EMBL-EBI), GBIF Helpdesk (2024). INSDC Sequences. Version 1.82. European Nucleotide Archive (EMBL-EBI). Occurrence dataset https://doi.org/10.15468/sbmztx accessed via GBIF.org on 2024-04-29.
物種 | 原始紀錄物種 | 類群 | 日期 | 行政區 | 資料集 | |
---|---|---|---|---|---|---|
Thrips tabaci 蔥薊馬 | Thrips tabaci | 其他昆蟲 | INSDC Sequences | |||
Eurya gnaphalocarpa Hayata 毛果柃木 | Eurya gnaphalocarpa | 被子植物 | INSDC Sequences | |||
dengue virus type 3 | 1999-01-01 | INSDC Sequences | ||||
Human immunodeficiency virus 1 | 2004-01-01 | INSDC Sequences | ||||
Hylarana latouchii 拉都希氏赤蛙 | Hylarana latouchii | 兩棲類 | INSDC Sequences | |||
Camphora micrantha (Hayata) Y.Yang, Bing Liu & Zhi Yang 冇樟 | Cinnamomum micranthum f. kanehirae | 被子植物 | 2005-01-01 | INSDC Sequences | ||
Anoplophora horsfieldi; Anoplophora horsfieldi taiwanensis 荷菲氏星天牛; 荷菲氏星天牛 | Anoplophora horsfieldi tonkinensis | 甲蟲類 | 2008-07-15 | INSDC Sequences | ||
Kajikia audax 紅肉旗魚 | Kajikia audax | 魚類 | INSDC Sequences | |||
Aleiodes | Aleiodes sp. ASQSP1023-10 | 其他昆蟲 | 2005-05-01 | INSDC Sequences | ||
Chrysoporthe deuterocubensis | 2012-10-26 | INSDC Sequences | ||||
Amynthas 遠環蚓屬 | Amynthas sp. 1 CHC-2013 | 其他無脊椎 | INSDC Sequences | |||
Eurya chinensis R.Br. 米碎柃木 | Eurya chinensis | 被子植物 | INSDC Sequences | |||
Pseudococcus 粉介殼蟲屬 | Pseudococcus dendrobiorum | 其他昆蟲 | 2013-05-01 | INSDC Sequences | ||
Microphysogobio alticorpus 高身小鰾鮈 | Microphysogobio alticorpus | 魚類 | 2009-08-06 | INSDC Sequences | ||
Dolicheulota formosensis 臺灣長蝸牛 | Dolicheulota formosensis | 蝸牛與貝類 | 2010-06-27 | INSDC Sequences | ||
Choanephora circinans | Poitrasia circinans | 真菌類 | 2013-12-24 | INSDC Sequences | ||
Nebria uenoiana | Nebria uenoiana | 甲蟲類 | 2012-05-02 | INSDC Sequences | ||
Nebria formosana | Nebria formosana | 甲蟲類 | 2011-08-03 | INSDC Sequences | ||
Chauliodus 蝰魚屬 | Chauliodus sp. ASIZP0078638 | 魚類 | INSDC Sequences | |||
Phytophthora 疫黴屬 | Phytophthora attenuata | 原藻類 | 2013-01-01 | INSDC Sequences | ||
Aleiodes | Aleiodes sp. BF002167_2661 | 其他昆蟲 | INSDC Sequences | |||
Wikstroemia retusa A.Gray 倒卵葉蕘花 | Wikstroemia retusa | 被子植物 | 2014-03-06 | INSDC Sequences | ||
Meruliopsis leptocystidiata | INSDC Sequences | |||||
Spiranthes 綬草屬 | Spiranthes australis x Spiranthes sinensis | 被子植物 | 2017-04-25 | INSDC Sequences | ||
Erioscyphella | Erioscyphella sp. | 真菌類 | 2012-04-15 | INSDC Sequences |