Drosophila and human data processing

Drosophila data

Our PopFly database contains information of more than 960 worldwide wild-derived Drosophila melanogaster genome sequences from 30 populations out of 18 countries comprising 5 continents. These genomes come from the major sequencing projects in this model species performed up to date. As part of the Drosophila Genome Nexus Project (Lack et al. 2015; Lack et al. 2016), all sequences have been re-assembled using a common pipeline to reduce the potential bias due to methodological differences. Only populations with at least 4 sampled genome sequences with less than 20% of missing or ambiguous nucleotides (after filtering by identity by descent, admixture, and heterozygosity) are included.

The geographic origin of each Drosophila population, as well as the number of individuals, the collector, collection dates, and some geographical parameters (latitude, longitude and elevation), are displayed in the following dynamic map when hovering the mouse over the location spot of each population.

Figure 1. Geographic origin and additional information of each of the PopFly populations. Population codes: AUS: Australia; CHB: China; CO: Cameroon; EA: Ethiopia; EB: Ethiopia; ED: Ethiopia; EF: Ethiopia; EG: Egypt; ER: Ethiopia; EZ: Ethiopia; FR: France; GA: Gabon; GU: Guinea; KN: Kenya; KR: Kenya; MW: Malawi; NTH: Netherlands; NG: Nigeria; RAL: United States; RG Rwanda; SB: South Africa; SD: South Africa; SF: South Africa; SP: South Africa; UG: Uganda; UK: Uganda; USI: United States; USW: United States; ZI: Zambia; ZS: Zimbabwe.

In the iMKT web service, we provide the information of 13,753 protein coding genes in 16 populations from PopFly.

The service allows estimating the proportion of adaptive substitutions (α) as well as the fraction of negative selection by implementing several MK-derived approaches using polymorphism data and divergence data (out of Drosophila simulans). The recombination rate estimates are from Comeron (Comeron et al. 2012).

Human data

The 1000 Genomes Project (1000GP) set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. In the final phase of the project (Phase 3), the consortium published the reconstruction of the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping (The 1000 Genomes Project Consortium 2015). With 84.7 million single nucleotide polymorphisms (SNPs), the resource is estimated to include >99% of SNP variants with a frequency of >1% for a variety of ancestries.

PopHuman represents the most complete pipeline for population genomics analysis of the 1000GP data. The 26 analyzed 1000GP populations (The 1000 Genomes Project Consortium 2015) are listed in the following interactive map:

Figure 2. Geographic origin and additional information of each of the PopHuman populations. Population codes: CEU: Utah residents; GBR: British; FIN: Finnish; IBS: Iberian Population; TSI: Toscani; ESN: Esan; GWD: Gambian; LWK: Luhya; MSL: Mende, YRI: Yoruba; ACB: African Caribbean; ASW: Americans with African Ancestry; CDX: Chinese Dai, CHB: Han Chinese; CHS: Southern Han Chinese; JPT: Japanese; KHV: Kign; BEB: Bengali; GIH: Gujarati; ITU; Indian Telugu; PJL: Punjabi; STU: Sri Lankan Tamil; CLM: Colombians; MXL: Americans with Mexican Ancestry; PEL: Peruvians; PUR: Puerto Ricans.

In the iMKT web service, we provide the information of 20,661 protein coding genes in the 26 populations of the 1000GP. The service allow estimating the proportion of adaptive substitutions (α) as well as the fraction of negative selection by implementing several MK-derived approaches using polymorphism data and divergence data (out of chimpanzee). The recombination rate estimates are from Bhérer (Bhérer et al. 2017).