ALSPAC OMICs Data Catalogue
Table of Contents
- 1. Introduction
- 2. Catalogue overview
- 3. Genetic Array Data
- 4. Imputed Data
- 4.1. Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)
- 4.2. Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)
- 4.3. Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)
- 4.4. Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)
- 4.5. Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)
- 5. Sequence Data
- 6. Epigenetic Data
- 7. Gene Expression Data
- 8. Omics tips
- 8.1. Introduction
- 8.2. Disclaimer
- 8.3. Operating systems
- 8.4. Key Omics software
- 8.5. File types
- 8.6. Variant/SNP ids
- 8.7. Overview of Imputation reference panels
- 8.8. SNP data types from imputation.
- 8.9. SNP Statistics
- 8.10. Best practice
- 8.11. Population stratification
- 8.12. Common tasks
- 8.13. Courses
- 8.14. Further sources of help
1. Introduction
Welcome to the ALSPAC Omics Catalogue, a guide to the omics data offered by ALSPAC. This catalogue features a variety of named ALSPAC datasets, each consisting of collected or produced data that has been organized, named, and curated for ease of use. Every named ALSPAC dataset comes with accompanying metadata that provides information about the dataset as a whole. Each named ALSPAC dataset has at least one release version that includes a curated selection of files detailed in the metadata sections.
Please note that these datasets are not generally accessible. Please see http://www.bristol.ac.uk/alspac/researchers/access/ for details for access.
The information within this catalogue is made available for browsing to help both internal ALSPAC users and external researchers understand the data and facilitate prospective data requests.
For external ALSPAC collaborators, we offer as standard "freezes" of specific dataset versions of named ALSPAC datasets. These freezes, along with their metadata, are outlined in this catalogue. External collaborators will be granted access to these freezes upon request (See http://www.bristol.ac.uk/alspac/researchers/access/ ). A freeze represents a carefully selected subset of data files within a version, containing the core data from a dataset with withdrawn consent removed and specific dataset IDs applied. These freezes are subject to periodic updates.
Due to the removal of withdrawn individuals from the freezes, please note that the number of participants within each dataset may change over time and may not match those found in the Methodology fields.
Freeze 1 timings: July 2021 - Dec 2022 Freeze 2 timings: Dec 2022 - Dec 2023 Freeze 3 timings: Jan 2023 - Current
Documentation for the current freeze is in the form of a yaml file is present below, listing the files external collaborators will receive, accompanied by metadata.
The metadata presented in our catalogue adheres to the ALSPAC Data catalogue Schema, which is crafted in LinkML. To explore the full schema documentation, please visit: https://alspac.github.io/alspac-data-catalogue-schema/
This website is equipped with RDFa, enabling the metadata to be machine-readable and allowing for the creation of queries using SPARQL with compatible tools, such as Apache Any23 and Apache Jena.
For more information about this see the document on FAIR data principles and the document describing the rational and construction of this catalogue here.
2. Catalogue overview
alspacdcs:alspac data catalogue 001 a dcat:Catalog |
|
---|---|
schema:description |
This catalogue is for all of the named alspac omics data sets.
|
schema:email |
alspac-omics@bristol.ac.uk
|
schema:name |
ALSPAC Omics Data Catalogue
|
alspacdcs:named alspac datasets |
,
,
,
,
,
,
,
,
,
,
and
|
alspacdcs:primary investigator orcids | |
alspacdcs:see also |
3. Genetic Array Data
3.1. Genome-wide - Illumina 550 quad - G1 (gwa_550_g1)
3.1.1. Description
This dataset contains genome wide array data genotype calls for G1 individuals.
3.1.2. Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8). Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed. SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
3.1.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_550_g1_2022-12-05_f3 name: >- Genome-wide array data including raw files and genotype calls for G1 individuals 2022-12-05 freeze 3 description: >- The third freeze of the genome-wide array data for G1 based on a 2022-12-05 release. The data is in plink format. freeze_size: 997M linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gwa_550_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:gwa_550_g1_2022-12-05_f2 freeze_of_alspac_dataset_version: alspacdcs:gwa_550_g1_2022-12-05 freeze_of_named_alspac_dataset: alspacdcs:gwa_550_g1 has_containers: - id: alspacdcs:5b87a9bf-879b-4d26-b3e2-aab9b14a1fdb ## uuid name: data description: A dir/folder containing the two freeze data files has_parts: - id: alspacdcs:2fde6fb6-a1a9-454b-b0bc-51d450a80447 name: Biallelic genotype table description: >- genotype data data_distributions: - id: alspacdcs:0dcc9a3c-7d5c-446e-a9a0-3493db443d0e name: freeze_id.bed description: >- Plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. md5sum: 8ce44ce1dbf5c4d7f3299681cbf3dacf filesize: 982M filetype: .bed number_of_participants: 8224 number_of_variants: 500527 belongs_to_container: alspacdcs:5b87a9bf-879b-4d26-b3e2-aab9b14a1fdb - id: alspacdcs:af4a19ce-a0c0-4086-80da-da4a6865dae0 name: Variant Information description: >- Information about SNPS data_distributions: - id: alspacdcs:f00b1310-f7f6-47c7-b46d-7082f43f542d name: freeze_id.bim description: >- Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: c7fa007331fab0e8b6ce5b78412848da filesize: 14M filetype: .bim number_of_variants: 500527 belongs_to_container: alspacdcs:5b87a9bf-879b-4d26-b3e2-aab9b14a1fdb - id: alspacdcs:c3ac5077-d8d4-44d5-9456-3b731d23f67f name: sample info description: >- Sample ids data_distributions: - id: alspacdcs:c20cd22a-61ac-49a6-8adb-2b877868784f name: freeze_id.fam description: >- A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: befc659c2383222218e4e002665bdfb7 filesize: 249k filetype: .fam number_of_participants: 8224 belongs_to_container: alspacdcs:5b87a9bf-879b-4d26-b3e2-aab9b14a1fdb - id: alspacdcs:8d57fbeb-51de-48f9-a92f-92d70f936a5a name: Heterozygous haploid and nonmale Y chromosome call list description: >- A plink report data_distributions: - id: alspacdcs:dfc98933-69fa-4e53-99ab-55e50836ccbf name: freeze_id.hh description: >- Produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome. A text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields: Family ID Within-family ID Variant ID md5sum: 2b5b4d40d9f4a18755a94efd7c9709e3 filesize: 1.7M filetype: .hh belongs_to_container: alspacdcs:5b87a9bf-879b-4d26-b3e2-aab9b14a1fdb - id: alspacdcs:e32ff428-5f2d-4c04-9c29-940c2812a867 name: Logs description: >- plink log data_distributions: - id: alspacdcs:09912f6a-dd58-4723-8f0b-ae6825d30dc4 name: freeze_id.log description: >- plink log file md5sum: a54bea59148ed6b1aabfd590cea050a6 filesize: 1.2K filetype: .log belongs_to_container: alspacdcs:5b87a9bf-879b-4d26-b3e2-aab9b14a1fdb
3.2. Genome-wide - Illumina exome core array - G0 partners (gwa_exome_g0p)
3.2.1. Description
This dataset contains genome wide array genotype calls for G0 mothers and partners.
3.2.2. Methodology
3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3). Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.
3.2.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_exome_g0p_2016-11-22_f3 name: Freeze 3 version 2016-11-22 Genome-wide - Illumina exome core array - G0 partners description: >- Freeze 3 version 2016-11-22 Genome-wide array data including raw files and genotype calls for G0 partners, also including additional G0 mothers who were absent from previous genotyping rounds freeze_size: 281M linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gwa_exome_g0p/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-1913 previous_freeze: alspacdcs:gwa_exome_g0p_2016-11-22_f2 freeze_of_alspac_dataset_version: alspacdcs:gwa_exome_g0p_2016-11-22 freeze_of_named_alspac_dataset: alspacdcs:gwa_exome_g0p has_containers: - id: alspacdcs:6c843f1c-5225-4780-8a93-58315f5e9dfe name: data description: A dir/folder containing the plink data files has_parts: - id: alspacdcs:c5d1dff1-3f6d-4506-aebd-c56db36e8d85 name: freeze_id data_distributions: - id: alspacdcs:353f07d8-4345-4904-bca2-f6549612f38d name: freeze_id.fam description: >- A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) Here We use both the first two fields to have the full id of the participant. i.e. not separate family and within family ids. md5sum: 7c8d1559304240941ef9a047d84299f4 filesize: 123KB filetype: .fam number_of_participants: 2198 belongs_to_container: alspacdcs:6c843f1c-5225-4780-8a93-58315f5e9dfe - id: alspacdcs:c5f6f2d2-3f61-4bfa-a43c-c6e349a76607 name: freeze_id data_distributions: - id: alspacdcs:1b737f6f-f7b5-425c-ae16-45ce9ae8796c name: freeze_id.bim description: >- Extended variant information file accompanying a .bed binary genotype table. (in plink you can use --make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: 0fe43f888776059fef0a76d3f08d00ad filesize: 14MB filetype: .bim number_of_variants: 507586 belongs_to_container: alspacdcs:6c843f1c-5225-4780-8a93-58315f5e9dfe - id: alspacdcs:cceed8c5-0276-4dbe-a617-7b585078caa0 name: freeze_id data_distributions: - id: alspacdcs:40da3e0d-a450-46ee-954e-c1b70751f3d0 name: freeze_id.bed description: >- Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. md5sum: 304b0d356880c5174806ce08d7beffd3 filesize: 267MB filetype: .bed number_of_participants: 2198 number_of_variants: 507586 belongs_to_container: alspacdcs:6c843f1c-5225-4780-8a93-58315f5e9dfe - id: alspacdcs:19556134-925a-4753-8834-933c4c74e784 name: freeze_id data_distributions: - id: alspacdcs:2d9618aa-32bb-4b32-b62a-1b43f785584d name: freeze_id.log md5sum: 9df3e3d178b71bbc1370e89a329ae543 filesize: 1.2KB filetype: .log belongs_to_container: alspacdcs:6c843f1c-5225-4780-8a93-58315f5e9dfe - id: alspacdcs:b0d09887-a085-4d7c-a0d2-c9986a72c3db name: freeze_id data_distributions: - id: alspacdcs:e69564ad-9e75-4eea-8779-5ecaf04cff23 name: freeze_id.hh description: >- plink .hh file see https://www.cog-genomics.org/plink/1.9/formats#hh md5sum: 96660f1fa14a45bda605acdfb92f2d3e filesize: 116K filetype: .hh belongs_to_container: alspacdcs:6c843f1c-5225-4780-8a93-58315f5e9dfe
3.3. Genome-wide - Illumina 660 quad - G0 mothers (gwa_660_g0m)
3.3.1. Description
This dataset contains genome-wide array data including raw files and genotype calls for G0 mothers.
3.3.2. Methodology
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded. Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained. This resulted in 9,048 subjects and 526,688 SNPs passed these quality control filters.
3.3.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gwa_660_g0m_2022-12-05_f3 name: Freeze 3 version 2022-12-05 Genome-wide - Illumina 660 quad - G0 mothers description: >- Freeze 3 of genome-wide array data including genotype calls for G0 mothers freeze_size: 2G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gwa_660_g0m/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 freeze_of_alspac_dataset_version: alspacdcs:gwa_660_g0m_2022-12-05 freeze_of_named_alspac_dataset: alspacdcs:gwa_660_g0m has_containers: - id: alspacdcs:b610f7ab-8af9-4bd4-8edc-4d90cd0d2763 name: data description: A dir/folder containing the plink data files - id: alspacdcs:ab3e1d38-8d4f-46c9-b860-cbccddecd012 name: legacy1 description: A dir/folder containing the plink data files. Includes full set of SNPs but is missing ~500 mothers who were excluded in legacy QC due to strict relatedness inclusion thresholds. belongs_to_container: alspacdcs:b610f7ab-8af9-4bd4-8edc-4d90cd0d2763 - id: alspacdcs:9f6244b7-b1c6-4164-9180-c996255d8de1 name: legacy2 description: A dir/folder containing the plink data files Includes full set of individuals but due to legacy QC is restricted to a set of ~480k SNPs that overlap with the Illumina 550k array (which was used for G1). belongs_to_container: alspacdcs:b610f7ab-8af9-4bd4-8edc-4d90cd0d2763 has_parts: - id: alspacdcs:39b88df9-de1c-4abd-abd2-68751b6a8e26 name: Biallelic genotype table description: >- The genetic data data_distributions: - id: alspacdcs:e0cfc624-5e48-43d0-b31a-160ed23e9768 name: freeze_id.bed description: >- Legacy 1 plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. The legacy1 distribution of the plink bed file. md5sum: bb6389e3421f8c94994e85cf7390ae79 filesize: 1021M filetype: .bed number_of_participants: 8123 number_of_variants: 526688 belongs_to_container: alspacdcs:ab3e1d38-8d4f-46c9-b860-cbccddecd012 - id: alspacdcs:822e8560-f2f1-49fa-8b03-af745fe130ba name: freeze_id.bed description: >- Legacy 2 plink bed file. Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. The legacy2 distribution of the plink bed file. md5sum: 870190e42e10c8c902f21e4f2f1cb96e filesize: 962M filetype: .bed number_of_variants: 465740 number_of_participants: 8653 belongs_to_container: alspacdcs:9f6244b7-b1c6-4164-9180-c996255d8de1 - id: alspacdcs:58b8c0ca-ae2c-4cf6-b4f7-7cbf17a3b10f name: Variant Information description: >- Information about genetic variants data_distributions: - id: alspacdcs:60b656f8-fb99-4ccd-9701-d7d896b4658d name: freeze_id.bim description: >- Legacy 1 Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: db817272cbc16d31083e1c788f03996c filesize: 14M filetype: .bim number_of_variants: 526688 belongs_to_container: alspacdcs:ab3e1d38-8d4f-46c9-b860-cbccddecd012 - id: alspacdcs:de57fccd-d6a5-4355-8522-c283f9ca589c name: freeze_id.bim description: >- Legacy 2 Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields: 1.Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name 2. Variant identifier 3. Position in morgans or centimorgans (safe to use dummy value of '0') 4. Base-pair coordinate (1-based; limited to 231-2) 5. Allele 1 (corresponding to clear bits in .bed; usually minor) 6. Allele 2 (corresponding to set bits in .bed; usually major) md5sum: 8eb8e81af2af1e06d818ca391488f210 filesize: 13M filetype: .bim number_of_variants: 465740 belongs_to_container: alspacdcs:9f6244b7-b1c6-4164-9180-c996255d8de1 - id: alspacdcs:f456e79f-2255-42d8-a121-82d80293a034 name: Sample information description: >- Information about the samples for the dataset data_distributions: - id: alspacdcs:c59e34c3-62e6-4750-8beb-665876b255ff name: freeze_id.fam description: >- legacy 1 A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: e8d10db354416efa0a3cfe60dcf7d7df filesize: 254K filetype: .fam number_of_participants: 8123 belongs_to_container: alspacdcs:ab3e1d38-8d4f-46c9-b860-cbccddecd012 - id: alspacdcs:b454c198-d15b-4420-ab75-270c0377c6eb name: freeze_id.fam description: >- legacy2 A text file with no header line, and one line per sample with the following six fields: 1. Family ID ('FID') 2. Within-family ID ('IID'; cannot be '0') 3. Within-family ID of father ('0' if father isn't in dataset) 4. Within-family ID of mother ('0' if mother isn't in dataset) 5. Sex code ('1' = male, '2' = female, '0' = unknown) 6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) md5sum: b21a5ceb9b0aa2193614ee6f45da0bfa filesize: 448k filetype: .fam number_of_participants: 8653 belongs_to_container: alspacdcs:9f6244b7-b1c6-4164-9180-c996255d8de1 - id: alspacdcs:971d0382-a395-465c-ba91-e73d5957c768 name: Log information description: >- Information about the plink run for making the dataset data_distributions: - id: alspacdcs:d6df50aa-c36f-4af9-8224-40f0cbd44e21 name: freeze_id.log description: >- legacy 1 plink log file md5sum: de43ed5543e1cf7ec2abc351dd702190 filesize: 1.1k filetype: .log belongs_to_container: alspacdcs:ab3e1d38-8d4f-46c9-b860-cbccddecd012 - id: alspacdcs:2e99b905-8f7a-4679-91af-19eaa043b345 name: Log information description: >- Information about the plink run for making the dataset data_distributions: - id: alspacdcs:fe7667c9-50ce-4385-a3f4-62a740a65336 name: freeze_id.log description: >- legacy2 plink log file md5sum: 4a9cfccbef5ee2ce0b2ac502a8f83790 filesize: 1.1k filetype: .log belongs_to_container: alspacdcs:9f6244b7-b1c6-4164-9180-c996255d8de1
3.4. Genome-wide - CNV - G1 (cnv_550_g1)
3.4.1. Description
This dataset contains predicted ALSPAC CNVs using PennCNV, generated from 23andMe raw genotype data.
3.4.2. Methodology
LRR and BAF data was missing from the 23andMe raw genotype data, so we had to generate this data ourselves using an in house algorithm. Once this data was generated, we ran PennCNV using the hh550 libraries.
There are filtered PennCNV calls. Multiple calls were merged using the 'clean_cnv.pl' script, using a merge fraction of 0.5. Individuals with > 30 CNVs, a Log R Ratio SD of >0.3, a BAF drift of > 0.002, and a waviness factor of > 0.05 were removed. CNVs in which at least 50% of the length of the CNV call overlapped with any of telomeric centromeric, immunoglobulin regions were removed using the 'scan_region.pl' script in PennCNV.
In addition, CNVs covering fewer than 5 probes, of a length < 5kb, and with a confidence score of below 10 were removed. Density was calculated as the number of probes in a CNV divided by the length of the CNV, and CNVs where the density of probes across the call was < 1 probe per 20kb was removed.
These QC parameters are suggestions only and provided in filtered.cnv. Analysts can apply their own filter parameters to the raw calls in data.cnv
3.4.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:cnv_550_g1_2015-11-09_f3 name: Genome-wide - CNV - G1 release version 2015-11-09 freeze 3 description: >- This is the third freeze of the 2015-11-09 version of cnv_550_g1 dataset. It contains two csv versions of the cnv called data, the unfilterd and filtered versions. freeze_size: 27m linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_cnv_550_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:cnv_550_g1_2015-11-09_f2 freeze_of_alspac_dataset_version: alspacdcs:cnv_550_g1_2015-11-09 freeze_of_named_alspac_dataset: alspacdcs:cnv_550_g1 has_parts: - id: alspacdcs:cnv_550_g1_2015-11-09_cnvdata_f3 name: Unfiltered CNV data description: >- This is the output of Penncnv before filtering. columns V1 - Position V2 - Number of markers in the region V3 - CNV length V4 - Copy number estimate V6 - Start SNP V7 - End SNP V8 - Confidence score qlet - within pregnancy ID cnv_550_g1 - Individual ID data_distributions: - id: alspacdcs:cdd6bfc2e28db5a76806aa24c73df110_new_cnvdata.csv name: new_cnvdata.csv description: >- This is the csv file for the output of Penncnv before filtering. md5sum: cdd6bfc2e28db5a76806aa24c73df110 filesize: 21M filetype: .csv number_of_participants: 7450 #data$id_qlet <- paste(data$cnv_550_g1, data$qlet, sep="_") #length(unique(data$id_qlet)) number_of_cnv_variants: 70030 # Read file into R as data then: # dim(unique(data[1])) belongs_to_container: alspacdcs:bd0fb41e-f720-46a7-9ed0-04dd3e0b22bd - id: alspacdcs:cnv_550_g1_2015-11-09_filtered_f3 name: Filtered CNV data description: >- CNV data that has been filtered. columns V1 - Position V2 - Number of markers in the region V3 - CNV length V4 - Copy number estimate V6 - Start SNP V7 - End SNP V8 - Confidence score qlet - within pregnancy ID cnv_550_g1 - Individual ID data_distributions: - id: alspacdcs:98eb9cb3bfd21eb807800de82f1e8099_new_filtered.csv name: new_filtered.csv description: >- This is the csv file for the output of Penncnv after filtering. md5sum: 98eb9cb3bfd21eb807800de82f1e8099 filesize: 5.9M filetype: .csv number_of_participants: 6793 # Read into data 2 in r # data2$id_qlet <- paste(data2$cnv_550_g1, data2$qlet, sep="_") and length(unique(data2$id_qlet)) number_of_cnv_variants: 14244 #Read into data2 in r then #length(unique(data2$V1)) belongs_to_container: alspacdcs:bd0fb41e-f720-46a7-9ed0-04dd3e0b22bd has_containers: - id: alspacdcs:bd0fb41e-f720-46a7-9ed0-04dd3e0b22bd ## uuid name: data description: A dir/folder containing the two freeze data files
4. Imputed Data
4.1. Genome-wide - HRC imputed - G0 mothers + G1 (gi_hrc_g0m_g1)
SNP chips are useful for the generation of data on hundreds of thousands of SNPs, but there are millions more polymorphisms that remain untyped with this technology. If suitable numbers of whole genome sequences exist (e.g. 1000 genomes data) then millions of genotypes that are missing from a sample because they have not been typed by SNP chips can be imputed using probabilistic methods. Here the ALSPAC mother and children data were imputed to a new reference panel known as the Haplotype Reference Consortium (HRC) panel. This comprises around 31000 sequenced individuals (mostly European), so the coverage of European haplotypes is much greater than in other panels. As a consequence imputation accuracy is expected to improve, particularly at lower frequencies.
4.1.1. Description
This dataset contains genotype data imputed to HRC for G0 mothers and G1.
4.1.2. Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8). Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed. SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded. Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and 465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT (v2.r644) which utilises relatedness during phasing. The phased haplotypes were then imputed to the Haplotype Reference Consortium (HRCr1.1, 2016) panel of approximately 31,000 phased whole genomes. The HRC panel was phased using ShapeIt v2.r727, and the imputation was performed using the Michigan imputation server.
This gave 8,237 eligible children and 8,196 eligible mothers with available genotype data after exclusion of related subjects using cryptic relatedness measures described previously.
Phasing parameters: States: 100 Window: 2 Effective population size: 11418 Genetic map: 1000 genomes phase 1 version 3 (release date 21/05/2011) Burn in iterations: 7 Pruning iterations: 8 Main iterations: 20
4.1.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f3 name: >- Genome-wide - HRC imputed - G0 mothers + G1 version 2017-05-04 freeze 3 description: >- Freeze 3 of version 2017-05-04 Genome-wide array data imputed to the HRC reference panel for G0 mothers and G1 individuals in bgen and sample file format (version 1.2). freeze_size: 115G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gi_hrc_g0m_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:gi_hrc_g0m_g1_2017-05-04_f2.1 next_freeze: freeze_of_alspac_dataset_version: alspacdcs:gi_hrc_g0m_g1_2017-05-04 freeze_of_named_alspac_dataset: alspacdcs:gi_hrc_g0m_g1 has_containers: - id: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c ## uuid name: data description: A dir/folder containing the freeze data bgen and .sample files has_parts: - id: alspacdcs:0209cfa4-2362-484a-922d-022bac8f1dc9 name: swapped_23_female data_distributions: - id: alspacdcs:6382dee1-05cc-48e7-badc-b44c7bd8cc42 name: swapped_23_female.sample md5sum: c2798a33724ef0b56889f36d36420aab filesize: 746.1KB filetype: .sample number_of_participants: 12948 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:138dceb1-a573-48fe-86bb-96e7e6e1ec15 name: filtered_17 data_distributions: - id: alspacdcs:87c528dc-6ac8-45e2-94d0-7f620ab9d3c4 name: filtered_17.bgen md5sum: a3e2c94bd25abbbb12b88b837a70b627 filesize: 3.6GB filetype: .bgen number_of_variants: 1090072 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:42c56ef1-eba2-418f-b20a-a8a0d65544a5 name: filtered_11 data_distributions: - id: alspacdcs:03ca58c5-d0ec-401f-9308-b6d0f8eb4107 name: filtered_11.bgen md5sum: 1091def1834d9df7a1250bd9e906771b filesize: 5.2GB filetype: .bgen number_of_variants: 1936990 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:30076bbe-b1b1-4308-8f21-ef89e99fc649 name: filtered_23female data_distributions: - id: alspacdcs:049f1f5e-3982-4ad5-b820-36f154cd2309 name: filtered_23female.bgen md5sum: 6a2870bfad1f6dc302e30f348141c702 filesize: 4.2GB filetype: .bgen number_of_variants: 1228035 number_of_participants: 12948 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:12b615a9-c0dc-48b2-ad00-d0fb9555630c name: filtered_10 data_distributions: - id: alspacdcs:7fa6070c-e7a9-4a7e-9c3a-680be67ad3cc name: filtered_10.bgen md5sum: f918d645452701e313c73a497ce0a7d3 filesize: 5.1GB filetype: .bgen number_of_variants: 1927504 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:4e3ffa23-5232-4a62-9253-cf1c8b77a850 name: filtered_16 data_distributions: - id: alspacdcs:b0a79594-0c96-4c8f-b800-489acebeef35 name: filtered_16.bgen md5sum: 06703f1d5b4ea054278c674f03c7fc99 filesize: 4.1GB filetype: .bgen number_of_variants: 1281298 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:106c9401-9550-4d0f-b79a-58e556523876 name: filtered_12 data_distributions: - id: alspacdcs:13841df6-ed45-451e-b6a5-f33d19528c65 name: filtered_12.bgen md5sum: 9cb8162b978697ad6c7ac32073cdf30a filesize: 5.1GB filetype: .bgen number_of_variants: 1848118 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:27eb64d7-9cfd-4a6d-9c9f-70b40e2b8fa7 name: filtered_08 data_distributions: - id: alspacdcs:72904e08-6d37-49e3-af93-3291c331779b name: filtered_08.bgen md5sum: 2bbc8cb921804dfd62c24d9b4250a179 filesize: 5.7GB filetype: .bgen number_of_variants: 2242706 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:eae1d46d-3999-4a6f-a8fe-a5ab643f4c9b name: swapped data_distributions: - id: alspacdcs:ad8ac21f-60b5-4dbc-920c-e190048d0ec7 name: swapped.sample md5sum: 7d4874c35dd01c388c62bc4fc3ac1409 filesize: 1005.5KB filetype: .sample number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:e1fa5fe7-a9e5-4e99-a401-602b882e01ed name: filtered_04 data_distributions: - id: alspacdcs:6b41585e-94bb-483b-aa6c-c2784db43bad name: filtered_04.bgen md5sum: a548f469f6883f4b1800a9f5af485731 filesize: 7.9GB filetype: .bgen number_of_variants: 2787582 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:ccc93888-421b-4196-9a45-ed77094fb288 name: filtered_23male data_distributions: - id: alspacdcs:25ce4f1a-01f1-4d1e-bf82-6e662bdb7560 name: filtered_23male.bgen md5sum: 3724d6e36f9e29bfaadeaf28cf3bff19 filesize: 1.2GB filetype: .bgen number_of_variants: 1228035 number_of_participants: 4502 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:f82ab219-40ad-479d-a81a-b6884e62ec5f name: filtered_05 data_distributions: - id: alspacdcs:251df0c3-924b-4bdf-b17f-d161abcf410b name: filtered_05.bgen md5sum: b2cc0285b722d34a91beae96a0336021 filesize: 6.7GB filetype: .bgen number_of_variants: 2588170 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:5a4281e6-6a66-4d67-ab41-bb3ad8b28a70 name: filtered_19 data_distributions: - id: alspacdcs:034b00e1-b895-43df-87c5-4eb50534134d name: filtered_19.bgen md5sum: 57c941571a65f230946b6cceac9389c5 filesize: 3.4GB filetype: .bgen number_of_variants: 868554 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:204cc6dc-3d90-4206-8311-cc0d13c426ce name: filtered_15 data_distributions: - id: alspacdcs:45cf7a62-b9fd-4dc6-96e9-d141092fee95 name: filtered_15.bgen md5sum: f667e6a2c995a6d0e34c2088c46a73aa filesize: 3.4GB filetype: .bgen number_of_variants: 1139215 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:50b32525-c314-45aa-8e9b-13aa334e4829 name: filtered_06 data_distributions: - id: alspacdcs:d54f5f9d-4a18-4578-b583-8ee442c232d0 name: filtered_06.bgen md5sum: ca6e82024a0967dfbabef45f0ee0a36a filesize: 6.3GB filetype: .bgen number_of_variants: 2460112 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:6492b05b-c96e-4212-a9df-66fd8166a303 name: filtered_07 data_distributions: - id: alspacdcs:302a2770-064d-47c6-89c7-486bae0885de name: filtered_07.bgen md5sum: 6e1a041862675c915607fba05d2921d9 filesize: 6.6GB filetype: .bgen number_of_variants: 2289306 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:190f6eca-5f94-4612-b21b-5ac29c96ed8a name: filtered_09 data_distributions: - id: alspacdcs:1634ab85-cff8-4c22-8ffa-a7132473d6c1 name: filtered_09.bgen md5sum: 95dc66c33bae79e5e0e37a165d10382c filesize: 4.5GB filetype: .bgen number_of_variants: 1675899 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:81eea65d-58a4-4761-8718-53d41fce694f name: filtered_01 data_distributions: - id: alspacdcs:576d1c2c-bbd5-4ea6-8979-9ff0a1bb8d64 name: filtered_01.bgen md5sum: 761d9e2114870f7d3e7dfe5c94b7ce47 filesize: 8.6GB filetype: .bgen number_of_variants: 3069932 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:d6abd7c5-0094-4af7-9bcc-a01a43bf819c name: filtered_20 data_distributions: - id: alspacdcs:3915759f-bb8c-4605-aca8-211381ca2ecc name: filtered_20.bgen md5sum: 7f6feb66e94a43ccc17dd9cee4246885 filesize: 2.6GB filetype: .bgen number_of_variants: 884983 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:98612003-9a2a-41df-b5e5-52df68a5509e name: filtered_18 data_distributions: - id: alspacdcs:685b818e-5c55-4c28-8d79-5bab676d34f4 name: filtered_18.bgen md5sum: d9a36ab3d148e15190279503d17c67e6 filesize: 3.1GB filetype: .bgen number_of_variants: 1104755 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:8ea74f9e-062c-4845-98a9-c8fc8ad4bb32 name: filtered_03 data_distributions: - id: alspacdcs:69234c86-7086-457d-9c69-71568aaef7e1 name: filtered_03.bgen md5sum: 38813c0a552aa33edb2b4eabb99dae61 filesize: 7.3GB filetype: .bgen number_of_variants: 2821895 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:669682ff-de83-463d-a6b3-77ccecbe7668 name: swapped_23_male data_distributions: - id: alspacdcs:3621a305-27b0-4637-acdf-018ab3b95127 name: swapped_23_male.sample md5sum: 9e3d7be6d73c999c3c55a0a0dbdfa468 filesize: 259.5KB filetype: .sample number_of_participants: 4502 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:047161b7-af54-4314-986b-92001ccb0d5c name: filtered_21 data_distributions: - id: alspacdcs:9baef6f9-86d8-4fed-be4f-051da7f0f739 name: filtered_21.bgen md5sum: 222f20b8fb823008e4a66cb9df14ee25 filesize: 1.7GB filetype: .bgen number_of_variants: 531276 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:65f75f83-ad4b-4b79-b27d-5121a15625a1 name: filtered_13 data_distributions: - id: alspacdcs:e3bdf2f7-2d7f-4404-82ee-2ff785c52d08 name: filtered_13.bgen md5sum: df09412f808737ced2e275a3bec68a75 filesize: 3.7GB filetype: .bgen number_of_variants: 1385434 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:4c00ce92-d7e7-47c6-a9a8-455fda77b5ef name: filtered_14 data_distributions: - id: alspacdcs:f2442b3d-44b3-4478-a447-1e59b965ae48 name: filtered_14.bgen md5sum: 176cb3af44e9a771d9909f4a0a7219f2 filesize: 3.5GB filetype: .bgen number_of_variants: 1266536 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:6980c0da-8698-4a0c-bff9-b03dea47369d name: filtered_02 data_distributions: - id: alspacdcs:763cd279-809d-45ab-855b-e2492160df9c name: filtered_02.bgen md5sum: bd3acd8e0c90cf94bd0d0e889b317591 filesize: 8.7GB filetype: .bgen number_of_variants: 3392238 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c - id: alspacdcs:72a3efb1-bc9d-4cea-ad47-11c602c4ab37 name: filtered_22 data_distributions: - id: alspacdcs:cd0cd547-c990-48c5-b52f-937e4f9f0678 name: filtered_22.bgen md5sum: 15d8c159b8a3a01ebb16db4f915f52c5 filesize: 1.8GB filetype: .bgen number_of_variants: 524544 number_of_participants: 17450 belongs_to_container: alspacdcs:f5eeb1f7-159b-4068-b876-b09d4864377c
4.2. Genome-wide - HapMap2 imputed - G1 (gi_hapmap2_g1)
4.2.1. Description
This dataset contains genotype data imputed to HapMap 2 for G1.
4.2.2. Methodology
A total of 9912 subjects were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform by 23 and Me subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA. Individuals were excluded from further analysis on the basis of having incorrect gender assignments; minimal or excessive heterozygosity (<0.320 and >0.345 for the Sanger data and <0.310 and >0.330 for the LabCorp data); disproportionate levels of individual missingness (>3%); evidence of cryptic relatedness (>10% IBD) and being of non-European ancestry (as detected by a multidimensional scaling analysis seeded with HapMap 2 individuals, EIGENSTRAT analysis revealed no additional obvious population stratification and genome-wide analyses with other phenotypes indicate a low lambda). The resulting data set consisted of 8365 individuals (84% of those genotyped). SNPs with a minor allele frequency of <1% and call rate of <95% were removed. Furthermore, only SNPs which passed an exact test of Hardy-Weinberg equilibrium (P > 5 x 10-7) were considered for analysis. Genotypes were subsequently imputed with MACH 1.0.16 Markov Chain Haplotyping software, using CEPH individuals from phase 2 of the HapMap project as a reference set (release 22).
Associated publication: https://doi.org/10.1093/hmg/ddr309
4.2.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hapmap2_g1_2022-12-07_f3 name: Genome-wide - HapMap2 imputed - G1 version 2022-12-07 freeze 3 description: >- Freeze 3 of 2022-12-07 version of Genome-wide array data imputed to the HapMap2 reference panel for G1 individuals freeze_size: 5G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gi_hapmap2_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:gi_hapmap2_g1_2022-12-07_f2 next_freeze: freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g1_2022-12-07 freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g1 has_containers: - id: alspacdcs:63f74523-9ddd-4cc7-9037-d166bd1edba9 name: data description: A dir/folder containing the plink freeze data files has_parts: - id: alspacdcs:1219c7c8-6a5b-482a-bdfc-2f26a4df5885 name: freeze_id data_distributions: - id: alspacdcs:60f60918-c307-4350-a88e-1138c360a72b name: freeze_id.fam md5sum: 0b42e5ffecb0fef6ed4702d1932eb424 filesize: 274KB filetype: .fam number_of_participants: 8224 belongs_to_container: alspacdcs:63f74523-9ddd-4cc7-9037-d166bd1edba9 - id: alspacdcs:dd090263-6041-48f4-a28c-af81ff709614 name: freeze_id data_distributions: - id: alspacdcs:25a9b8c3-0426-4cfb-b916-d6807c36fc50 name: freeze_id.bim md5sum: 854d50582220c70ae5645b1a1c799af1 filesize: 68MB filetype: .bim number_of_variants: 2543887 belongs_to_container: alspacdcs:63f74523-9ddd-4cc7-9037-d166bd1edba9 - id: alspacdcs:8b10969b-0e48-4c27-8e9d-efc812042839 name: freeze_id data_distributions: - id: alspacdcs:b2436f3c-fbd7-4470-ba44-dac809c767a3 name: freeze_id.bed md5sum: 90f19b52657a4fff8b301efbd87ea057 filesize: 4.9GB filetype: .bed number_of_variants: 2543887 number_of_participants: 8224 belongs_to_container: alspacdcs:63f74523-9ddd-4cc7-9037-d166bd1edba9 - id: alspacdcs:260ae08f-7077-4fa6-b59d-e1249afddbcf name: freeze_id data_distributions: - id: alspacdcs:3313fbd3-cd49-4c77-a1c6-5cc48d17fdbb name: freeze_id.log md5sum: 01f4f9e9e96f38d23fc50f386fd9a081 filesize: 958B filetype: .log belongs_to_container: alspacdcs:63f74523-9ddd-4cc7-9037-d166bd1edba9
4.3. Genome-wide - HapMap2 imputed - G0 mothers (gi_hapmap2_g0m)
4.3.1. Description
This dataset contains genotype data imputed to HapMap 2 for G0 mothers.
4.3.2. Methodology
A total of 10 015 women (mothers from the ALSPAC cohort) were genotyped using the Illumina 660 quad SNP chip which contains 557 124 SNP markers. Markers with minor allele frequency < 1%, SNPs with >5% missing genotypes and any markers that failed an exact test of Hardy-Weinberg equilibrium (P < 1 x 10-6) were excluded from further analyses. Genome-wide identity by state sharing was calculated for each pair of individuals in the cohort to identify cryptic relatedness. In order to identify individuals who might have ancestries other than Western European, we merged data from both cohorts with the 60 western European (CEU) founder, 60 Nigerian (YRI) founder and 90 Japanese (JPT) and Han Chinese (CHB) individuals from the International HapMap Project. Genome-wide IBS distances for each pair of individuals were calculated on markers shared between the HapMap and the Illumina 660K SNP chip, and then the multidimensional scaling option in R was used to generate a two-dimensional plot based upon individuals' scores on the first two principal coordinates from this analysis. Samples that did not cluster with the CEU individuals were excluded from subsequent analyses. In addition, we plotted the proportion of missing data for each individual against their genome-wide heterozygosity. Any individual, who did not cluster with others, was removed from further analyses. Samples were also excluded from analyses in the case of excessive missingness (>5%), unusual genome-wide or X chromosome heterozygosity, as well as one individual from each pair of putatively related individuals (genome-wide IBD >10%). After data cleaning, 8340 individuals and 526688 SNPs were left in the genome-wide data set.
We then conducted imputation using the MACH Markov Chain Haplotyping software with CEU individuals from phase 2 of the HapMap project as a reference set (release 22). The final imputed data set consisted of 8340 individuals, each with 2 594 390 imputed markers. Only imputed genotypes with minor allele frequencies ≥1% and R-sqr ≥0.3 were considered for association. Of these 8340 with genetic data, 2874 mothers also had phenotype data available.
Associated publication: https://doi.org/10.1093/hmg/ddt239
4.3.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_hapmap2_g0m_2022-12-07_f3 name: Genome-wide - HapMap2 imputed - G0 mothers version 2022-12-07 freeze 3 description: >- Version 2022-12-07 freeze 3 of Genome-wide array data imputed to the HapMap2 reference panel for G0 mothers. The number of variants & individuals within each plink file set can be viewed within the log file. freeze_size: 4.9G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gi_hapmap2_g0m/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:gi_hapmap2_g0m_2022-12-07_f2 freeze_of_alspac_dataset_version: alspacdcs:gi_hapmap2_g0m_2022-12-07 freeze_of_named_alspac_dataset: alspacdcs:gi_hapmap2_g0m has_containers: - id: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 ## uuid name: plink description: A dir/folder containing the plink freeze data files. There are 8123 individuals within this dataset. has_parts: - id: alspacdcs:2a0c29b0-c0d3-4b42-a1af-ca050ffc4c69 name: freeze_id_chr7 data_distributions: - id: alspacdcs:a338bb73-c267-4744-87ec-445574c38a7b name: freeze_id_chr7.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:cf60a483-992a-4288-a209-7bb8e475c232 name: freeze_id_chr16 data_distributions: - id: alspacdcs:1e666615-f5f3-48e8-844a-947deb910273 name: freeze_id_chr16.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:eee7a728-b4c1-4dba-8895-92945ee1d5da name: freeze_id_chr6 data_distributions: - id: alspacdcs:c77a7f1f-f262-4c69-b5be-875817dc987f name: freeze_id_chr6.log md5sum: 1f402fdbae9972ac3de162d17e07686f filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:f1a9a04d-cffc-4ec6-8bde-c7dbd52d11db name: freeze_id_chr20 data_distributions: - id: alspacdcs:c6d97d9a-7ac8-49a4-b2bb-433de0913204 name: freeze_id_chr20.log md5sum: 75a0270192c017296773f508784bf94a filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:d0212c8a-b2cd-42d7-9023-9d379588e522 name: freeze_id_chr5 data_distributions: - id: alspacdcs:941aaf89-6286-455e-b980-a557d2a3d940 name: freeze_id_chr5.log md5sum: 9f55daab5fb6e209e7276744e454a9e7 filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c559f04b-256f-4157-ba60-acb657915425 name: freeze_id_chr5 data_distributions: - id: alspacdcs:56d2a337-61a1-497b-a05a-bbf455e3c6ae name: freeze_id_chr5.bim md5sum: a54c2542a07c18c0303c29b4f34a107e filesize: 4.4MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:0e403787-f35f-4d35-a40d-baf78c4ef900 name: freeze_id_chr20 data_distributions: - id: alspacdcs:dc1dd187-987d-4fc4-94e8-14d91446f22b name: freeze_id_chr20.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:6568b5c1-83e7-4c2f-8b05-3aad674a6cc2 name: freeze_id_chr6 data_distributions: - id: alspacdcs:eddd9880-45f2-44c8-a9ae-720adf2995b8 name: freeze_id_chr6.bed md5sum: 684a7aed0e2b84bed80fd59175adb083 filesize: 353.3MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:3042183d-897a-434b-939f-7fcda032740e name: freeze_id_chr21 data_distributions: - id: alspacdcs:efd7e847-bd17-4f16-a25b-eee3ad71574a name: freeze_id_chr21.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:2ad71f36-75c6-4862-a6e2-b1405fbffec6 name: freeze_id_chr15 data_distributions: - id: alspacdcs:b6e9d45e-b3a6-42bb-b93e-961b6332b74b name: freeze_id_chr15.bim md5sum: 5a6ef9aa0087da88eef574bc16d2a359 filesize: 1.9MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:54626420-1dda-48a7-a817-eba0ce443baa name: freeze_id_chr1 data_distributions: - id: alspacdcs:b14f3adb-fca2-4b2f-b86f-f256ac90498c name: freeze_id_chr1.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:1a28b5e2-f9bf-4a18-90e2-b21e42847d73 name: freeze_id_chr3 data_distributions: - id: alspacdcs:9203dd2c-1a3e-453b-bb75-9aeea5f6fff0 name: freeze_id_chr3.bed md5sum: 3c516f3dc65bef87665fa1d4a9293a2a filesize: 337.7MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:aa843393-d65c-42b0-a668-ae82db2bfa5a name: freeze_id_chr1 data_distributions: - id: alspacdcs:c4e968a3-098e-4c30-af50-2a6c135516b4 name: freeze_id_chr1.bed md5sum: 281cd2702a11d797a9465d63d80a3fc9 filesize: 374.9MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:1fa3e3a1-d145-4424-adce-439a23e39f23 name: freeze_id_chr18 data_distributions: - id: alspacdcs:fe71b45e-244e-4dc3-8e76-2c843a9e41fc name: freeze_id_chr18.bim md5sum: b2c3323ec24277ecff0b221f6d2795e3 filesize: 2.1MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:5e3b344a-9a0b-4920-80da-c228b7ea9e69 name: freeze_id_chr12 data_distributions: - id: alspacdcs:ab775cd6-55ad-42b6-a6e4-b56b51f90ecd name: freeze_id_chr12.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:d9fd5026-873b-4338-9392-00d550549aad name: freeze_id_chr4 data_distributions: - id: alspacdcs:03fab819-6746-44e4-9a8e-b2a0765eb485 name: freeze_id_chr4.bed md5sum: e3d259ed4e229f0b0d82933e1e3e9c00 filesize: 316.0MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:be3078e0-da76-40f9-99ee-20d87dfc9a1b name: freeze_id_chr12 data_distributions: - id: alspacdcs:f7f50212-6814-431c-9839-41aa2adc3e70 name: freeze_id_chr12.bed md5sum: d5472422c20fb1bf4a4dd7d54e55ef72 filesize: 241.8MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:eaaf914b-2f14-4b45-9d9e-34fd676365e0 name: freeze_id_chr2 data_distributions: - id: alspacdcs:f5d7d497-eb03-42c7-a976-528e384b9325 name: freeze_id_chr2.log md5sum: 0a49fbf307da5c8d72504241c9604685 filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:d1a2426b-701a-47d4-b294-70c7983ab615 name: freeze_id_chr2 data_distributions: - id: alspacdcs:2f6be232-c686-405d-b94d-e28536bc2bbe name: freeze_id_chr2.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c6645029-416d-4773-97fd-d4cf5b8efb75 name: freeze_id_chr18 data_distributions: - id: alspacdcs:db741e70-cde2-41e9-b03c-f5fe07b43403 name: freeze_id_chr18.log md5sum: 37dc1ac70a199c45688c54a50dbfe62a filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:f80d1aae-a683-474e-a817-4a10bcb1e403 name: freeze_id_chr20 data_distributions: - id: alspacdcs:6ef7c246-cad3-440b-ab48-6ac5576d6071 name: freeze_id_chr20.bed md5sum: bff6128829dcce99eaa87ce643b53894 filesize: 122.8MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:801b6dc1-acf3-4531-a343-23ae06f84fac name: freeze_id_chr4 data_distributions: - id: alspacdcs:fd01c3e0-5fcc-4aae-8396-90e950896735 name: freeze_id_chr4.log md5sum: c13619d1640920bff07d3c7a59c971d3 filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:2689a053-ef8e-4c06-b6b1-61105f4026fe name: freeze_id_chr10 data_distributions: - id: alspacdcs:360e890a-738d-40f6-bbb0-2012c2125ac9 name: freeze_id_chr10.log md5sum: c4ccca8e64f12bfcf15a7e70be1a3f06 filesize: 994.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c74b56df-5025-4a2e-86d5-66d21fd23ffe name: freeze_id_chr14 data_distributions: - id: alspacdcs:884070a5-a2d1-4ec1-8bef-e7e5a79567d8 name: freeze_id_chr14.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:646350e7-af4f-4521-86dc-4efb78d78b39 name: freeze_id_chr17 data_distributions: - id: alspacdcs:df2cd9cd-bf28-499a-9de2-26f5676e6e4e name: freeze_id_chr17.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:456a3a2c-5452-4147-8a44-d8ce22b2cced name: freeze_id_chr8 data_distributions: - id: alspacdcs:a376176f-c1d6-4ce1-895d-54a1b3a2c12a name: freeze_id_chr8.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:bcc01530-fe95-4c6f-98c1-768a0f9345bf name: freeze_id_chr9 data_distributions: - id: alspacdcs:f216feb5-a5ed-4496-9333-74ea7bdf2f9f name: freeze_id_chr9.bed md5sum: a20184b5477ec3292a7d321404191751 filesize: 236.5MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:68b6e7cc-1655-4858-87ea-16a8c1afbe94 name: freeze_id_chr6 data_distributions: - id: alspacdcs:269b45c4-98f5-43d5-abe8-5dd7bf9a1183 name: freeze_id_chr6.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:f8b35cf8-ccd4-42ad-937c-f69109cfd02d name: freeze_id_chr12 data_distributions: - id: alspacdcs:49c411bc-2bb5-4d82-8b42-a625660e1695 name: freeze_id_chr12.log md5sum: a9905ea972946b7522b872a8103df342 filesize: 994.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:91755239-994d-4458-92f2-cd306ffd4278 name: freeze_id_chr4 data_distributions: - id: alspacdcs:3d9d591e-aff0-4156-bdf9-6446d48d66af name: freeze_id_chr4.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:631dc001-27ef-49ff-94c2-918003a62bc9 name: freeze_id_chr19 data_distributions: - id: alspacdcs:a933b3eb-a440-4d39-bd5f-ddefbbfa219e name: freeze_id_chr19.bim md5sum: eb33f90e6c4f631aa9683c2fff5e75f7 filesize: 1012.3KB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:3ed7b96a-da6e-4f83-abc3-6470482eff59 name: freeze_id_chr20 data_distributions: - id: alspacdcs:3f0a17a7-52d7-4125-bd48-1b84ff359d53 name: freeze_id_chr20.bim md5sum: 2839fc89dc4e7a0b88107ed8c7d848b5 filesize: 1.7MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:fb01e5fb-2135-41d4-a81b-fb7e9528ff46 name: freeze_id_chr21 data_distributions: - id: alspacdcs:52fd64f9-d143-4c05-b528-406b7bfbb14c name: freeze_id_chr21.bim md5sum: b9b6bcad51c47f94a6b42c9b564c48b3 filesize: 924.7KB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:488d784f-9ad8-4d35-8066-7b9d219bc1a9 name: freeze_id_chr13 data_distributions: - id: alspacdcs:eb0c1deb-d2b8-4117-9d72-99560c96e9f1 name: freeze_id_chr13.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:a26378b3-d3b4-4310-bc08-9daa0ba284f8 name: freeze_id_chr15 data_distributions: - id: alspacdcs:9d991a05-1e1f-4e5b-b4fd-937c8ed462b9 name: freeze_id_chr15.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:443a8575-72af-46c3-9c88-5aa01799ac92 name: freeze_id_chr11 data_distributions: - id: alspacdcs:3b0092d3-8515-4a5a-a750-9a6725d959f8 name: freeze_id_chr11.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:fdb290af-3470-4bb7-83a2-1df1a812517e name: freeze_id_chr14 data_distributions: - id: alspacdcs:ef876dc3-698b-4a5b-a9d2-526644b5d615 name: freeze_id_chr14.bim md5sum: e6098ee196c0fb7b5db26cbd31bcb096 filesize: 2.3MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:e4b2ec7a-0fee-47ea-b592-61edc0af2eb5 name: freeze_id_chr2 data_distributions: - id: alspacdcs:590d3a80-352e-47f8-84b9-11958e5364e4 name: freeze_id_chr2.bed md5sum: d6de77dc17d5eb8892b203171d0ced48 filesize: 427.7MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:e431119d-30b7-46c7-baff-991d30503ffd name: freeze_id_chr18 data_distributions: - id: alspacdcs:018f21f5-5f99-4b78-bd94-3dda5dae643b name: freeze_id_chr18.bed md5sum: 79a3cf5cccd16362910c770e40effa1a filesize: 148.8MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:35415f32-a6f8-47d6-bd17-4f9069e28d42 name: freeze_id_chr7 data_distributions: - id: alspacdcs:cd3b229f-8873-437f-aeb6-a9fc01f79950 name: freeze_id_chr7.bim md5sum: 54756145070ef0d80ba126d0c5ae8ea6 filesize: 3.8MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:75d323f3-bdf0-468b-9196-c1c55de89b58 name: freeze_id_chr15 data_distributions: - id: alspacdcs:2c39ea83-48c7-400a-85c9-5b83ec92f9fb name: freeze_id_chr15.log md5sum: 33ff3129151bfaa0d98d82e628d082e4 filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c58f6371-6d98-4f2b-9ea8-1b4346d582ca name: freeze_id_chr21 data_distributions: - id: alspacdcs:86375ecf-b414-4392-8ab6-fcd6afbab8d9 name: freeze_id_chr21.bed md5sum: 9e6b8a4df4b7e9f0877f65d1d8de87b9 filesize: 65.6MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:606de011-edb1-43f1-baa1-54f24048479f name: freeze_id_chr4 data_distributions: - id: alspacdcs:7fd75972-cbea-4994-b21c-91dc7d5dffa0 name: freeze_id_chr4.bim md5sum: a2604d5a956bd4cbbdab49202aa69eea filesize: 4.3MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:dbf2ba9c-18c6-4097-8c05-583ebb47c7cb name: freeze_id_chr17 data_distributions: - id: alspacdcs:12864476-402f-4d29-9fb2-bdcb6f4942a7 name: freeze_id_chr17.log md5sum: cdfc7db390ca5dd8352e2a529bbd3e00 filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:5bc41cec-50db-4cf0-ba86-ce685812062d name: freeze_id_chr14 data_distributions: - id: alspacdcs:5567581e-b298-4cb9-8433-3ee9c104987f name: freeze_id_chr14.bed md5sum: 13d314d1b415f6ed09b498bbc722bfd0 filesize: 162.6MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:bda525de-9e7b-41d3-bca1-34f7788591ff name: freeze_id_chr3 data_distributions: - id: alspacdcs:bff82c67-4c01-447d-a41d-b05803940923 name: freeze_id_chr3.log md5sum: d0ea2d5c24cdf426e2c78123ac4574ea filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:272ae0a1-b6bb-4f2c-ba34-23119e2a5275 name: freeze_id_chr10 data_distributions: - id: alspacdcs:c952d88a-e7e3-4f85-a656-ca7977b67101 name: freeze_id_chr10.bim md5sum: e775a7afd2a2eed2608f2b65b2d829b9 filesize: 3.8MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:bc80fdc6-cbcb-4a67-a203-04db70d1d527 name: freeze_id_chr11 data_distributions: - id: alspacdcs:eeaf7bc7-e792-4e8f-957c-c0766a66a08c name: freeze_id_chr11.bim md5sum: 313af6755f961e49ae6de817c365d509 filesize: 3.5MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c9bcb40a-20ce-4e47-90e7-18879d64f9b1 name: freeze_id_chr9 data_distributions: - id: alspacdcs:9b876769-7afe-465f-a0ae-ea11800e9be7 name: freeze_id_chr9.log md5sum: bde38daafb9615032be46b6968cd116b filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:d9df8cc9-910d-45a3-9db3-5d4d632de1f9 name: freeze_id_chr16 data_distributions: - id: alspacdcs:a0ed7465-65e5-4966-9625-7a3da114b8a0 name: freeze_id_chr16.bed md5sum: 450e93cf17c7306ca1bfc9a5b7d4bbcb filesize: 138.6MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:0cdc4c6f-8e65-481c-8d43-7da09480dd04 name: freeze_id_chr13 data_distributions: - id: alspacdcs:d49044e2-1053-4d53-baec-8ccca789f4cc name: freeze_id_chr13.bed md5sum: ce0b1edef716d842cc4dadcbea3fbe2f filesize: 201.7MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:5792fa9d-fb9b-4f40-9fd2-0aa8f1a1a452 name: freeze_id_chr16 data_distributions: - id: alspacdcs:6999289c-b783-4f0d-8e0b-6984bdd04787 name: freeze_id_chr16.bim md5sum: b16a743ffcb5b7914b17b20edc7b9d2a filesize: 1.9MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c34459eb-d69f-4b62-a452-9d00ef6609f1 name: freeze_id_chr2 data_distributions: - id: alspacdcs:071dfc83-36ea-4796-9a23-d00d6a2e53ad name: freeze_id_chr2.bim md5sum: 3462f2499de7d0a050821ce398b54d1a filesize: 5.9MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:8cb108f2-8391-4fde-8b5e-83a89c0240a5 name: freeze_id_chr1 data_distributions: - id: alspacdcs:4c8483bd-e02e-4dbb-bcc7-0185a70bfb36 name: freeze_id_chr1.log md5sum: 6cc071f7b7ff5f12fbcf6eea6dd73c43 filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c5aa8d14-feb6-4a7d-914e-7b0181635a90 name: freeze_id_chr22 data_distributions: - id: alspacdcs:4e2840e7-d233-40dc-bd87-29041a07e130 name: freeze_id_chr22.bed md5sum: 456cb4e948d412c835857055f441e1e8 filesize: 65.5MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:867c75d3-2575-4a54-ab64-651ff91f1521 name: freeze_id_chr17 data_distributions: - id: alspacdcs:db449126-ab5c-4458-a175-0784f5941f37 name: freeze_id_chr17.bim md5sum: fa35dae8d4fb5433460cc8d42deddb6e filesize: 1.6MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:3038426d-3f94-4957-bdc9-c96788c39157 name: freeze_id_chr18 data_distributions: - id: alspacdcs:98c73740-bc04-4814-a409-aba44758509f name: freeze_id_chr18.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:d56d3675-a4bd-489a-a837-b080d8300fc2 name: freeze_id_chr14 data_distributions: - id: alspacdcs:707295b8-cd15-4685-b876-74a69e1cee23 name: freeze_id_chr14.log md5sum: c54920b54bd39996480f01e6121ac944 filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:bf515d14-b521-4570-96f2-da4d53c49c88 name: freeze_id_chr9 data_distributions: - id: alspacdcs:51848fd3-3315-41ce-a7ed-de89c26c1480 name: freeze_id_chr9.bim md5sum: 729b7f9b759d4b97fb32cad65887b498 filesize: 3.2MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:bd53b8a3-bdeb-4530-a855-5eb3802cea7d name: freeze_id_chr1 data_distributions: - id: alspacdcs:d42ce2ab-e264-4604-85e7-a4c9bdfa443d name: freeze_id_chr1.bim md5sum: 443f081673deb28df6fd408bf1accb50 filesize: 5.1MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c7b4cd49-2719-4e34-a507-14d5b7f5b4b9 name: freeze_id_chr6 data_distributions: - id: alspacdcs:36b0f6b0-f6f5-4cdc-a2d6-0fdeb005b786 name: freeze_id_chr6.bim md5sum: 079964fd3600112a6241f47a0fa778ce filesize: 4.8MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:17f253d0-9d0e-4de7-91bd-236f169a32f2 name: freeze_id_chr7 data_distributions: - id: alspacdcs:437e339b-691d-4b22-a2ee-fbf7ed2fe32f name: freeze_id_chr7.log md5sum: 2b418ae4df815b3cb34cdd13fdb6b0fb filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:363ab6c5-2c5f-4e8b-aa8b-d20044a26e06 name: freeze_id_chr11 data_distributions: - id: alspacdcs:71d4d592-0282-4db0-9037-b23137f1db5c name: freeze_id_chr11.bed md5sum: 5736da3f7d53daaa4f74f9f0fd8baf40 filesize: 251.9MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:197f336f-d902-4bf1-ae63-d40308fa8e23 name: freeze_id_chr12 data_distributions: - id: alspacdcs:1a11a64b-c328-4ae8-8a4b-68ffcb30597e name: freeze_id_chr12.bim md5sum: 656b255438b8175b1739abf77e171526 filesize: 3.4MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:7f9a7384-e4eb-47e3-9e82-4857ed172383 name: freeze_id_chr7 data_distributions: - id: alspacdcs:634bc271-7760-4813-9633-aefc9207a047 name: freeze_id_chr7.bed md5sum: da1d1397bdb574aa88e32401065f80c5 filesize: 277.4MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:6b31c029-3d6b-4f87-953d-f90e8b6d6d09 name: freeze_id_chr22 data_distributions: - id: alspacdcs:a0afed4e-dbc6-41f6-bcb6-86462a0fa4e5 name: freeze_id_chr22.log md5sum: 6de55fc23e840f7c2d3e3f4b13296d6e filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:474b2704-2e9a-4ff8-b1a2-b60be16363fe name: freeze_id_chr3 data_distributions: - id: alspacdcs:a089aee0-b41d-4668-b22a-559631bb2402 name: freeze_id_chr3.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:aac0c218-a214-49d9-ba58-eb4311b62e98 name: freeze_id_chr16 data_distributions: - id: alspacdcs:3a59e242-c2d9-4844-8e0c-b38e226acaa5 name: freeze_id_chr16.log md5sum: 7146b2b9e7a3d5ab8a3e2c73d23e0f4a filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:c01959e8-02a4-494b-9cb2-a8efc4860007 name: freeze_id_chr9 data_distributions: - id: alspacdcs:2465cb98-ccea-42a1-bcc0-672e1fa9e767 name: freeze_id_chr9.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:8fdeeceb-a829-44a7-bec2-07d7ada330d3 name: freeze_id_chr3 data_distributions: - id: alspacdcs:6cf89d0d-b954-4ae2-bce7-7df2e833c392 name: freeze_id_chr3.bim md5sum: 3232f0e3bcf4f18841a75e96ca89dec5 filesize: 4.6MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:11752567-c3bb-41e2-84f5-1bfe4857290c name: freeze_id_chr10 data_distributions: - id: alspacdcs:4abd22b9-fedf-422b-80e8-ceed311a222b name: freeze_id_chr10.bed md5sum: 27c2dd1b9770772584ef2480c8a68e04 filesize: 268.1MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:2055447a-93cb-43db-ae2d-24e02e87daa3 name: freeze_id_chr8 data_distributions: - id: alspacdcs:45792020-f1bf-430d-bec7-0e6790e246f1 name: freeze_id_chr8.log md5sum: cb9f7054a4b8d577de31cf6b91ff11bf filesize: 988.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:51c93059-2dfd-4264-8721-b6df496f438c name: freeze_id_chr19 data_distributions: - id: alspacdcs:f9bae995-2699-4e0f-9d17-3c192c133759 name: freeze_id_chr19.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:38c6cb09-02f5-4fe0-abd6-4da5b4df0bce name: freeze_id_chr19 data_distributions: - id: alspacdcs:985aaf6f-c3bb-4fbd-a07b-e91f9a3dfbf0 name: freeze_id_chr19.log md5sum: 7a40cbff0a1c6db3fdba28f03a3d43a5 filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:5fc77dfb-4515-4a4e-aa68-25ef98407f4d name: freeze_id_chr15 data_distributions: - id: alspacdcs:42ba278c-bba0-4287-9ec3-cdb5e38e6464 name: freeze_id_chr15.bed md5sum: 8562fa62c12180b23c101cb361b773cf filesize: 140.0MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:b47e36f2-9137-44d5-a2d5-224d373a9186 name: freeze_id_chr19 data_distributions: - id: alspacdcs:98a3e062-3233-4c09-9520-9ae3f3df7c85 name: freeze_id_chr19.bed md5sum: 29af6d46019438d57c718cb36c71a2b2 filesize: 71.8MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:4678a09c-3488-4d3c-9f87-787cd2cdbc3d name: freeze_id_chr10 data_distributions: - id: alspacdcs:a88879b6-b183-469b-a554-21029bed8da5 name: freeze_id_chr10.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:7d789eff-f9cd-443a-886f-7a6b0ff2dd4f name: freeze_id_chr13 data_distributions: - id: alspacdcs:62277cf1-c229-45af-9675-d66c66ff61a3 name: freeze_id_chr13.log md5sum: ed83e0b365b8955f7e5045f9553b9b28 filesize: 994.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:94fcb145-3bba-4c7a-b115-1c3d245c5f14 name: freeze_id_chr22 data_distributions: - id: alspacdcs:3a4bfa2b-71d7-49e2-aa4f-a24863d61e1b name: freeze_id_chr22.bim md5sum: 86a1da3366ba87e62f561dc09f64f9ac filesize: 920.9KB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:aa5a1f7c-8a60-4bd8-b12b-d01efc0134dd name: freeze_id_chr8 data_distributions: - id: alspacdcs:b5cb6681-aa68-4102-ae82-b568ed04e1e7 name: freeze_id_chr8.bim md5sum: 72ecdd13e1c56de521d86b684218172d filesize: 3.9MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:3f7c5f16-2b9a-467e-ba02-5ef0092c712b name: freeze_id_chr17 data_distributions: - id: alspacdcs:e7ab3b7a-0dcd-4778-a2da-7d52707b0dc9 name: freeze_id_chr17.bed md5sum: 8860b99a8dceae07e3254f4f882f64de filesize: 113.2MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:430cf27a-376d-4082-9d86-b60beda1e1e6 name: freeze_id_chr22 data_distributions: - id: alspacdcs:df8edbb4-dfb4-4eb8-b6d9-5aded6c7e6c3 name: freeze_id_chr22.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:1c4e7e0d-cae4-49c4-9b35-0bd22dc78d1b name: freeze_id_chr13 data_distributions: - id: alspacdcs:7a42734f-674d-418f-a225-3a90c0083679 name: freeze_id_chr13.bim md5sum: 59c47611e6fc92f7561366c3dcd6e3fd filesize: 2.8MB filetype: .bim belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:5af6660f-84dc-4a4c-adf0-54763b36b0d2 name: freeze_id_chr21 data_distributions: - id: alspacdcs:4091dfef-1578-46c4-b5f5-070c3a9e8318 name: freeze_id_chr21.log md5sum: 4da56042ef5ef9f3be98805a49920350 filesize: 992.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:78192dcf-2817-4bf1-83e2-d23d4ee74e95 name: freeze_id_chr11 data_distributions: - id: alspacdcs:9b763e0b-c2f7-44a7-9edf-072506c43a7a name: freeze_id_chr11.log md5sum: d236624b3f13fe0404baf76b862b6b17 filesize: 994.0B filetype: .log belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:d955659d-27ae-47a0-9268-d3770d313d39 name: freeze_id_chr8 data_distributions: - id: alspacdcs:13c01f7b-9d87-4b52-b5e7-4ec49464b41e name: freeze_id_chr8.bed md5sum: 8794a1ea76d763ff1742b65bc3f94289 filesize: 285.7MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:009c3430-ae51-43d7-93cf-ffe5dc0f4f25 name: freeze_id_chr5 data_distributions: - id: alspacdcs:9ced704b-db10-47fa-9462-2c48a79099c3 name: freeze_id_chr5.bed md5sum: c4ec69fa754d7687661d56055a891f15 filesize: 325.7MB filetype: .bed belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6 - id: alspacdcs:003b9ab6-afce-4f9d-8a65-35f502f5581a name: freeze_id_chr5 data_distributions: - id: alspacdcs:b38a50e6-344a-4b83-95a8-83be59e2fb99 name: freeze_id_chr5.fam md5sum: 67185256052f381f1dc8c8eb3c1b18d2 filesize: 277.6KB filetype: .fam belongs_to_container: alspacdcs:dabc73ca-1d45-40bb-b0ed-11fab248ddf6
4.4. Genome-wide - 1000G imputed - G0 partners (gi_1000g_g0p)
4.4.1. Description
This dataset contains genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3.
4.4.2. Methodology
3,453 ALSPAC mother and fathers and 535,478 SNPs were genotyped using the Illumina HumanCoreExome chip genotyping platforms by the ALSPAC lab and called using GenomeStudio. The resulting raw genome-wide data were subjected to standard quality control methods using PLINK (v1.07). Individuals were excluded on the basis of gender mismatches (n = 80); minimal or excessive heterozygosity (n = 64); disproportionate levels of individual missingness (>5%, n = 60) and possible contamination (n = 3). Population stratification was assessed by multidimensional scaling analysis and compared with 1000 Genomes phase 3 data and principal component analysis (n = 266); all individuals with non-European ancestry were removed. Cryptic relatedness was measured as SNP relatedness in GCTA (relatedness > 0.1, n = 69 removed). SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 1E-7) and those which failed GenomeStudio quality control measures were removed (n = 21,298). 6,594 duplicate SNPs were also removed. This resulted in 2,911 unrelated mothers and father genotypes at 507,586 SNPs. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln.
We phased data of 3074 samples that passed qc but contained related subjects in shapeit v2.r837. We then removed 155,336 monomorphic SNPs, 1033 markers not in 1000 genomes, 11,842 A/T or G/C SNPs and 10 duplicate sites to give 337,732 SNPs on chromosomes 1-23. Of the 329,363 markers on chromosomes 1-22, 298,742 overlapped the reference genome. We imputed to the 1000 genomes phase 1 version 3 using the Michigan Imputation Server. We then identified 2217 samples where aln assigned historically by the lab matched genetically assigned aln. We then removed 12 subjects who have withdrawn consent and 6 subjects genotyped in an earlier work package to give 2201 subjects.
4.4.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_1000g_g0p_2016-11-22_f3 name: Genome-wide - 1000G imputed - G0 partners version 2016-11-22 freeze 3 description: >- This dataset is the third freeze of 2016-11-22 versiono of the Genome-wide array data imputed to the 1000 genomes reference panel for G0 partners, with some additional G0 mothers and G1 individuals. freeze_size: 44G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gi_1000g_g0p/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:gi_1000g_g0p_2016-11-22_f2 next_freeze: freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0p_2016-11-22 freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0p has_containers: - id: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc name: data description: A dir/folder containing the data bgen and sample files has_parts: - id: alspacdcs:gi_1000g_g0p_2016-11-22_sample_f3 name: Samples description: >- The samples in the data. To be used with the genetic data. data_distributions: - id: alspacdcs:ec597142ca1f0ace0a33f476c7bf68eb_swapped.sample name: swapped.sample description: >- A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. md5sum: ec597142ca1f0ace0a33f476c7bf68eb filesize: 165k filetype: .sample number_of_participants: 2198 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr1_f3 name: Chr1 description: Data for Chr1 data_distributions: - id: alspacdcs:a5eb049e4df5a8b005ae51b47947d830_filtered_data_chr01.bgen name: filtered_data_chr01.bgen description: >- An Oxford Bgen file for Chr1. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a5eb049e4df5a8b005ae51b47947d830 filesize: 3.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 2159337 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr2_f3 name: Chr2 description: Data for Chr2 data_distributions: - id: alspacdcs:e297c8d30455053d23ac360bcc886bb0_filtered_data_chr02.bgen name: filtered_data_chr02.bgen description: >- An Oxford Bgen file for Chr2. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: e297c8d30455053d23ac360bcc886bb0 filesize: 3.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 2349883 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr3_f3 name: Chr3 description: Data for Chr3 data_distributions: - id: alspacdcs:c0b55e9d65c219ffb1b8c58a0ebb7c18_filtered_data_chr03.bgen name: filtered_data_chr03.bgen description: >- An Oxford Bgen file for Chr1. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: c0b55e9d65c219ffb1b8c58a0ebb7c18 filesize: 3.0G filetype: .bgen number_of_participants: 2198 number_of_variants: 1969275 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr4_f3 name: Chr4 description: Data for Chr4 data_distributions: - id: alspacdcs:514f09f02c74fc3eca83379e9e99c5dc_filtered_data_chr04.bgen name: filtered_data_chr04.bgen description: >- An Oxford Bgen file for Chr4. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 514f09f02c74fc3eca83379e9e99c5dc filesize: 3.1G filetype: .bgen number_of_participants: 2198 number_of_variants: 1969883 - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr5_f3 name: Chr5 description: Data for Chr5 data_distributions: - id: alspacdcs:f4accbf5bdd6a2ccc9598e9e2221915d_filtered_data_chr05.bgen name: filtered_data_chr05.bgen description: >- An Oxford Bgen file for Chr5. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f4accbf5bdd6a2ccc9598e9e2221915d filesize: 2.8G filetype: .bgen number_of_participants: 2198 number_of_variants: 1809961 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr6_f3 name: Chr6 description: Data for Chr6 data_distributions: - id: alspacdcs:a9327ad1591fdf7d349b066544e71c3a_filtered_data_chr06.bgen name: filtered_data_chr06.bgen description: >- An Oxford Bgen file for Chr6. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a9327ad1591fdf7d349b066544e71c3a filesize: 2.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 1758025 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr7_f3 name: Chr7 description: Data for Chr7 data_distributions: - id: alspacdcs:f832922558eddcf3feed87091c2ec0ae_filtered_data_chr07.bgen name: filtered_data_chr07.bgen description: >- An Oxford Bgen file for Chr7. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f832922558eddcf3feed87091c2ec0ae filesize: 2.7G filetype: .bgen number_of_participants: 2198 number_of_variants: 1601293 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr8_f3 name: Chr8 description: Data for Chr8 data_distributions: - id: alspacdcs:47d79712e676a0048f90858cbb888179_filtered_data_chr08.bgen name: filtered_data_chr08.bgen description: >- An Oxford Bgen file for Chr8. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 47d79712e676a0048f90858cbb888179 filesize: 2.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 1558902 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr9_f3 name: Chr9 description: Data for Chr9 data_distributions: - id: alspacdcs:82a480f3e8792db2c1cec3adc50e1357_filtered_data_chr09.bgen name: filtered_data_chr09.bgen description: >- An Oxford Bgen file for Chr9. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 82a480f3e8792db2c1cec3adc50e1357 filesize: 1.9G filetype: .bgen number_of_participants: 2198 number_of_variants: 1189463 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr10_f3 name: Chr10 description: Data for Chr10 data_distributions: - id: alspacdcs:8f64fe184e4c876a345a728ed5eeddcf_filtered_data_chr10.bgen name: filtered_data_chr10.bgen description: >- An Oxford Bgen file for Chr10. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 8f64fe184e4c876a345a728ed5eeddcf filesize: 2.2G filetype: .bgen number_of_participants: 2198 number_of_variants: 1363104 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr11_f3 name: Chr11 description: Data for Chr11 data_distributions: - id: alspacdcs:b1b7e3bef0fe72cd90bd0ba456f687aa_filtered_data_chr11.bgen name: filtered_data_chr11.bgen description: >- An Oxford Bgen file for Chr11. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b1b7e3bef0fe72cd90bd0ba456f687aa filesize: 2.2G filetype: .bgen number_of_participants: 2198 number_of_variants: 1359640 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr12_f3 name: Chr12 description: Data for Chr12 data_distributions: - id: alspacdcs:509202db22200fe0bd58210ab8e9c757_filtered_data_chr12.bgen name: filtered_data_chr12.bgen description: >- An Oxford Bgen file for Chr12. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 509202db22200fe0bd58210ab8e9c757 filesize: 2.1G filetype: .bgen number_of_participants: 2198 number_of_variants: 1316510 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr13_f3 name: Chr13 description: Data for Chr13 data_distributions: - id: alspacdcs:176a10d38ab80783a8e392e5791edea7_filtered_data_chr13.bgen name: filtered_data_chr13.bgen description: >- An Oxford Bgen file for Chr13. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 176a10d38ab80783a8e392e5791edea7 filesize: 1.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 988473 - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr14_f3 name: Chr14 description: Data for Chr14 data_distributions: - id: alspacdcs:1ecd96aab2925bafd7d20497d85dd937_filtered_data_chr14.bgen name: filtered_data_chr14.bgen description: >- An Oxford Bgen file for Chr14. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 1ecd96aab2925bafd7d20497d85dd937 filesize: 1.5G filetype: .bgen number_of_participants: 2198 number_of_variants: 903811 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr15_f3 name: Chr15 description: Data for Chr15 data_distributions: - id: alspacdcs:f8c5b54206189808e9a361cc0da63798_filtered_data_chr15.bgen name: filtered_data_chr15.bgen description: >- An Oxford Bgen file for Chr15. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f8c5b54206189808e9a361cc0da63798 filesize: 1.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 814028 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr16_f3 name: Chr16 description: Data for Chr16 data_distributions: - id: alspacdcs:52f065575d3cb2dff34df6763a583766_filtered_data_chr16.bgen name: filtered_data_chr16.bgen description: >- An Oxford Bgen file for Chr16. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 52f065575d3cb2dff34df6763a583766 filesize: 1.6G filetype: .bgen number_of_participants: 2198 number_of_variants: 867901 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr17_f3 name: Chr17 description: Data for Chr17 data_distributions: - id: alspacdcs:73d85caf67dcedc63b11a43bd5ccb44d_filtered_data_chr17.bgen name: filtered_data_chr17.bgen description: >- An Oxford Bgen file for Chr17. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 73d85caf67dcedc63b11a43bd5ccb44d filesize: 1.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 755467 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr18_f3 name: Chr18 description: Data for Chr18 data_distributions: - id: alspacdcs:b8e055a6c0955bb67161c9f7a1d8cad7_filtered_data_chr18.bgen name: filtered_data_chr18.bgen description: >- An Oxford Bgen file for Chr18. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b8e055a6c0955bb67161c9f7a1d8cad7 filesize: 1.4G filetype: .bgen number_of_participants: 2198 number_of_variants: 783661 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr19_f3 name: Chr19 description: Data for Chr19 data_distributions: - id: alspacdcs:37ea045cd9f4027cba547b7b89c3a1a0_filtered_data_chr19.bgen name: filtered_data_chr19.bgen description: >- An Oxford Bgen file for Chr19. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 37ea045cd9f4027cba547b7b89c3a1a0 filesize: 1.3G filetype: .bgen number_of_participants: 2198 number_of_variants: 606147 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr20_f3 name: Chr20 description: Data for Chr20 data_distributions: - id: alspacdcs:d241eb21be3188c26c460e1f65f0d8c1_filtered_data_chr20.bgen name: filtered_data_chr20.bgen description: >- An Oxford Bgen file for Chr20. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: d241eb21be3188c26c460e1f65f0d8c1 filesize: 1.1G filetype: .bgen number_of_participants: 2198 number_of_variants: 618749 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr21_f3 name: Chr21 description: Data for Chr21 data_distributions: - id: alspacdcs:7881bdc24e7f0adbfb800b49d1efd590_filtered_data_chr21.bgen name: filtered_data_chr21.bgen description: >- An Oxford Bgen file for Chr21. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 7881bdc24e7f0adbfb800b49d1efd590 filesize: 672M filetype: .bgen number_of_participants: 2198 number_of_variants: 378064 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc - id: alspacdcs:gi_1000g_g0p_2016-11-22_chr22_f3 name: Chr22 description: Data for Chr22 data_distributions: - id: alspacdcs:824412e963441699f260c6245f65659d_filtered_data_chr22.bgen name: filtered_data_chr22.bgen description: >- An Oxford Bgen file for Chr22. To be used with file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 824412e963441699f260c6245f65659d filesize: 722M filetype: .bgen number_of_participants: 2198 number_of_variants: 366590 belongs_to_container: alspacdcs:70b53764-4ed1-4e46-9188-a38d356279dc
4.5. Genome-wide - 1000G imputed - G0 mothers + G1 (gi_1000g_g0m_g1)
4.5.1. Description
This dataset contains genome-wide 1000G imputed data for G0 mothers + G1. This data has been cleaned, flipped to the positive strand and in b37 coordinates and imputed to the 1000 genomes phase I version 3.
4.5.2. Methodology
ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme subcontracting the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, US. The resulting raw genome-wide data were subjected to standard quality control methods. Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8). Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed. SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed. Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally SNPs with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity. Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded. Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level. Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed these quality control filters.
We combined 477,482 SNP genotypes in common between the sample of mothers and sample of children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs removed) and removed a further 321 subjects due to potential ID mismatches. This resulted in a dataset of 17,842 subjects containing 6,305 duos and465,740 SNPs (112 were removed during liftover and 234 were out of HWE after combination). We estimated haplotypes using ShapeIT(v2.r644) which utilises relatedness during phasing. We obtained a phased version of the 1000 genomes reference panel (Phase 1, Version3) from the Impute2 reference data repository (phased using ShapeItv2.r644, haplotype release date Dec 2013). Imputation of the target data was performed using Impute V2.2.2 against the reference panel(all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).
This gave 8,237 eligible children and 8,196 eligible mothers withavailable genotype data after exclusion of related subjects using cryptic relatedness measures described previously.
4.5.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f3 name: >- Genome-wide - 1000G imputed - G0 mothers + G1 version 2015-10-30 freeze 3 description: >- This is the third freeze of the the 2015-10-30 version of gi_1000g_g0m_g1 datatset. It contains data in the oxford format which is a combination of bgen and sample (version 1.2) files. It is a subset of the data in gi_1000g_g0m_g1_2015-10-30 limited to one format and with participants who have withdrawn their consent removed. The Dec 2013 haplotype release of 1000 genomes phase 1 version 3 have 199 reported SNPs with incorrect strand. The strand issues are present in this imputation version. For more information and the origins of this list please visit: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html It is very unlikely that they have systematic effects across the genome and most probably are just isolated to these 199 known problematic SNPs. The user is advised to discard them from their analysis. This will be addressed in the next imputation release freeze_size: 122G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_gi_1000g_g0m_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:gi_1000g_g0m_g1_2015-10-30_f2 freeze_of_alspac_dataset_version: alspacdcs:gi_1000g_g0m_g1_2015-10-30 freeze_of_named_alspac_dataset: alspacdcs:gi_1000g_g0m_g1 has_parts: - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_sample_f3 name: Samples description: >- The samples in the data. To be used with the genetic data. data_distributions: - id: alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample name: swapped.sample description: >- A plain text .sample file. See https://doi.org/10.1101/308296 for file format details. md5sum: 86398f756a748b40e51d0b02ad86ce5b filesize: 1.2M filetype: .sample number_of_participants: 17450 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr1_f3 name: Chr1 description: Data for Chr1 data_distributions: - id: alspacdcs:d4386fe4fcbfd1464fec97335693bb47_filtered_01.bgen name: filtered_01.bgen description: >- An Oxford Bgen file for Chr1. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: d4386fe4fcbfd1464fec97335693bb47 filesize: 9.1G filetype: .bgen number_of_participants: 17450 number_of_variants: 2155158 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr2_f3 name: Chr2 description: Data for Chr2 data_distributions: - id: alspacdcs:a021b75c0bc519ed48c3342d428d988d_filtered_02.bgen name: filtered_02.bgen description: >- An Oxford Bgen file for Chr2. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a021b75c0bc519ed48c3342d428d988d filesize: 9.1G filetype: .bgen number_of_participants: 17450 number_of_variants: 2346862 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr3_f3 name: Chr3 description: Data for Chr3 data_distributions: - id: alspacdcs:bc61d427013f6a143209714af43fd3a7_filtered_03.bgen name: filtered_03.bgen description: >- An Oxford Bgen file for Chr1. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: bc61d427013f6a143209714af43fd3a7 filesize: 7.7G filetype: .bgen number_of_participants: 17450 number_of_variants: 1966662 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr4_f3 name: Chr4 description: Data for Chr4 data_distributions: - id: alspacdcs:9616c502415e3aefd3cec770201a1db9_filtered_04.bgen name: filtered_04.bgen description: >- An Oxford Bgen file for Chr4. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 9616c502415e3aefd3cec770201a1db9 filesize: 8.4G filetype: .bgen number_of_participants: 17450 number_of_variants: 1968171 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr5_f3 name: Chr5 description: Data for Chr5 data_distributions: - id: alspacdcs:f7146ed5bfdcc4d6399bbef64809d7a6_filtered_05.bgen name: filtered_05.bgen description: >- An Oxford Bgen file for Chr5. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: f7146ed5bfdcc4d6399bbef64809d7a6 filesize: 6.9G filetype: .bgen number_of_participants: 17450 number_of_variants: 1808090 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr6_f3 name: Chr6 description: Data for Chr6 data_distributions: - id: alspacdcs:3834f0465729fed20bcf89d7f27a7ef6_filtered_06.bgen name: filtered_06.bgen description: >- An Oxford Bgen file for Chr6. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 3834f0465729fed20bcf89d7f27a7ef6 filesize: 6.8G filetype: .bgen number_of_participants: 17450 number_of_variants: 1755859 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr7_f3 name: Chr7 description: Data for Chr7 data_distributions: - id: alspacdcs:5de9ed5dc646de7a7a5c9ca503d1212e_filtered_08.bgen name: filtered_07.bgen description: >- An Oxford Bgen file for Chr7. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 5de9ed5dc646de7a7a5c9ca503d1212e filesize: 7.1G filetype: .bgen number_of_participants: 17450 number_of_variants: 1599387 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr8_f3 name: Chr8 description: Data for Chr8 data_distributions: - id: alspacdcs:e78c84b883bc8fe52f0c33598cc815a3_filtered_08.bgen name: filtered_08.bgen description: >- An Oxford Bgen file for Chr8. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: e78c84b883bc8fe52f0c33598cc815a3 filesize: 5.9G filetype: .bgen number_of_participants: 17450 number_of_variants: 1557429 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr9_f3 name: Chr9 description: Data for Chr9 data_distributions: - id: alspacdcs:9948344bfdebdcd38a2b09224f1af23d_filtered_09.bgen name: filtered_09.bgen description: >- An Oxford Bgen file for Chr9. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 9948344bfdebdcd38a2b09224f1af23d filesize: 5.1G filetype: .bgen number_of_participants: 17450 number_of_variants: 1187731 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr10_f3 name: Chr10 description: Data for Chr10 data_distributions: - id: alspacdcs:1775551d5bac7b13d0e884b2015ba421_filtered_10.bgen name: filtered_10.bgen description: >- An Oxford Bgen file for Chr10. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 1775551d5bac7b13d0e884b2015ba421 filesize: 5.4G filetype: .bgen number_of_participants: 17450 number_of_variants: 1361506 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr11_f3 name: Chr11 description: Data for Chr11 data_distributions: - id: alspacdcs:99685738aff1b79b3028428983bed3f2_filtered_11.bgen name: filtered_11.bgen description: >- An Oxford Bgen file for Chr11. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 99685738aff1b79b3028428983bed3f2 filesize: 5.4G filetype: .bgen number_of_participants: 17450 number_of_variants: 1356882 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr12_f3 name: Chr12 description: Data for Chr12 data_distributions: - id: alspacdcs:c08cd053752044364b342e9873dedaea_filtered_12.bgen name: filtered_12.bgen description: >- An Oxford Bgen file for Chr12. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: c08cd053752044364b342e9873dedaea filesize: 5.4G filetype: .bgen number_of_participants: 17450 number_of_variants: 1314328 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr13_f3 name: Chr13 description: Data for Chr13 data_distributions: - id: alspacdcs:d6aec668a231fd5509b20f6f99cc5d26_filtered_13.bgen name: filtered_13.bgen description: >- An Oxford Bgen file for Chr13. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: d6aec668a231fd5509b20f6f99cc5d26 filesize: 4.0G filetype: .bgen number_of_participants: 17450 number_of_variants: 987740 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr14_f3 name: Chr14 description: Data for Chr14 data_distributions: - id: alspacdcs:33ee444ac5cccbc4d5a938f20cfc9506_filtered_14.bgen name: filtered_14.bgen description: >- An Oxford Bgen file for Chr14. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 33ee444ac5cccbc4d5a938f20cfc9506 filesize: 3.9G filetype: .bgen number_of_participants: 17450 number_of_variants: 904351 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr15_f3 name: Chr15 description: Data for Chr15 data_distributions: - id: alspacdcs:35a01cfb74f7006fc267a915c5f96531_filtered_15.bgen name: filtered_15.bgen description: >- An Oxford Bgen file for Chr15. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 35a01cfb74f7006fc267a915c5f96531 filesize: 3.7G filetype: .bgen number_of_participants: 17450 number_of_variants: 812545 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr16_f3 name: Chr16 description: Data for Chr16 data_distributions: - id: alspacdcs:a2be7316bcf32fd554f293650d99b265_filtered_16.bgen name: filtered_16.bgen description: >- An Oxford Bgen file for Chr16. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a2be7316bcf32fd554f293650d99b265 filesize: 4.3G filetype: .bgen number_of_participants: 17450 number_of_variants: 865998 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr17_f3 name: Chr17 description: Data for Chr17 data_distributions: - id: alspacdcs:97f06fcb1f5857e9510d2ba30eee6c4c_filtered_17.bgen name: filtered_17.bgen description: >- An Oxford Bgen file for Chr17. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 97f06fcb1f5857e9510d2ba30eee6c4c filesize: 3.8G filetype: .bgen number_of_participants: 17450 number_of_variants: 753174 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr18_f3 name: Chr18 description: Data for Chr18 data_distributions: - id: alspacdcs:88606600d2352a1127acf21a440273e2_filtered_18.bgen name: filtered_18.bgen description: >- An Oxford Bgen file for Chr18. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 88606600d2352a1127acf21a440273e2 filesize: 3.5G filetype: .bgen number_of_participants: 17450 number_of_variants: 783010 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr19_f3 name: Chr19 description: Data for Chr19 data_distributions: - id: alspacdcs:b2d78224a6ab150996caca3e4d3ef1df_filtered_19.bgen name: filtered_19.bgen description: >- An Oxford Bgen file for Chr19. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: b2d78224a6ab150996caca3e4d3ef1df filesize: 4.0G filetype: .bgen number_of_participants: 17450 number_of_variants: 603516 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr20_f3 name: Chr20 description: Data for Chr20 data_distributions: - id: alspacdcs:657274f33d9d44a243c59feae7ec561e_filtered_20.bgen name: filtered_20.bgen description: >- An Oxford Bgen file for Chr20. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 657274f33d9d44a243c59feae7ec561e filesize: 2.8G filetype: .bgen number_of_participants: 17450 number_of_variants: 617694 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr21_f3 name: Chr21 description: Data for Chr21 data_distributions: - id: alspacdcs:1d85b37ade01bf9921be5a10950e28c2_filtered_21.bgen name: filtered_21.bgen description: >- An Oxford Bgen file for Chr21. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 1d85b37ade01bf9921be5a10950e28c2 filesize: 1.9G filetype: .bgen number_of_participants: 17450 number_of_variants: 377554 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr22_f3 name: Chr22 description: Data for Chr22 data_distributions: - id: alspacdcs:a25f95d0477de8dc16234a93a9a4108c_filtered_22.bgen name: filtered_22.bgen description: >- An Oxford Bgen file for Chr22. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: a25f95d0477de8dc16234a93a9a4108c filesize: 2.1G filetype: .bgen number_of_participants: 17450 number_of_variants: 365644 - id: alspacdcs:gi_1000g_g0m_g1_2015-10-30_chr23_f3 name: Chr23 description: Data for Chr23 data_distributions: - id: alspacdcs:9fdb2874bc5f30f22c71be64037ebc70_filtered_23.bgen name: filtered_23.bgen description: >- An Oxford Bgen file for Chr23. To be used with alspacdcs:86398f756a748b40e51d0b02ad86ce5b_swapped.sample file. See https://doi.org/10.1101/308296 for file format details. (bgen v1.2) md5sum: 9fdb2874bc5f30f22c71be64037ebc70 filesize: 5.9G filetype: .bgen number_of_participants: 17450 number_of_variants: 1250218
5. Sequence Data
5.1. Whole genome sequencing - G1 (wgs_hiseq_g1)
5.1.1. Description
This dataset contains whole genome sequencing for G1 individuals, part of the UK10K dataset.
5.1.2. Methodology
ALSPAC and TwinsUK cohorts were sequenced at an average read depth of 6.7x through the UK10K program (http://www.uk10k.org) using the Illumina HiSeq platform, and aligned to the GRCh37 human reference using BWA. SNV calls were completed using samtools/bcftools and VQSR and GATK were used to recall these calls.
Associated publication: http://www.ncbi.nlm.nih.gov/pubmed/26367797
Please ensure you have permission to access this data (http://www.uk10k.org/data_access.html) before using it.
5.1.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:wgs_hiseq_g1_2016-08-18_f3 name: Whole genome sequencing - G1 version 2016-08-18 freeze 3 description: >- This is the freeze 3 of version 2016-08-18 of the Whole genome sequencing for G1 individuals, part of the UK10K dataset. freeze_size: 350G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_wgs_hiseq_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:wgs_hiseq_g1_2016-08-18_f2 freeze_of_alspac_dataset_version: alspacdcs:wgs_hiseq_g1_2016-08-18 freeze_of_named_alspac_dataset: alspacdcs:wgs_hiseq_g1 has_containers: - id: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 ## uuid name: data description: A dir/folder containing the freeze data files has_parts: - id: alspacdcs:5633a76c-fdc5-4cb2-9ff4-e42df8619662 name: 10_freeze data_distributions: - id: alspacdcs:146a37b8-6eec-41e3-b369-0044a20e429b name: 10_freeze.vcf.gz.csi md5sum: 91511bf844e95e2c3589c5e4b8d29dc0 filesize: 97K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:48be8dec-e137-474d-808e-535563799202 name: 10_freeze.vcf.gz md5sum: 0fbc3391092f6528bed700bb9678b160 filesize: 17G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:8b27a150-8809-4471-9374-1977c84496f6 name: 11_freeze data_distributions: - id: alspacdcs:4124a844-e108-42ed-8dfb-b0cebd4f1911 name: 11_freeze.vcf.gz.csi md5sum: 0ab4531bb18ae0e0ff4d2b60a91e03bc filesize: 97K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:3b35d468-471e-4f13-beef-16e7877f0a8b name: 11_freeze.vcf.gz md5sum: 9b0fca0596d382b861670db9d1f39a5d filesize: 17G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:3ae0b8ea-47f9-4a99-a8df-67a2b62d8688 name: 12_freeze data_distributions: - id: alspacdcs:58b10348-5dba-43f5-87c9-0d7dcb7de96c name: 12_freeze.vcf.gz.csi md5sum: 6cb3ea848c9d2148666ab9d10bf18116 filesize: 98K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:60ccbba8-6747-46c6-9f87-31889c028199 name: 12_freeze.vcf.gz md5sum: 51c4767c52e3771f6a1cf76f92686389 filesize: 17G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:ad543513-2135-4780-9af2-502401a46420 name: 13_freeze data_distributions: - id: alspacdcs:0f0c59e8-a860-4684-a5fa-8fed592af5f0 name: 13_freeze.vcf.gz.csi md5sum: 0d9c2f70487241a97ffe5076c40bedb6 filesize: 71K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:b14dd88d-d6c3-4ebc-bd08-5219c9531292 name: 13_freeze.vcf.gz md5sum: 89fbead51842c68140f9232264d0ee8d filesize: 13G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:1fcb13e9-5275-4c42-a402-6cdafdbecaaa name: 14_freeze data_distributions: - id: alspacdcs:d580c4bf-e41b-4e71-a24e-a5d2f94e8f82 name: 14_freeze.vcf.gz.csi md5sum: a8073057cb520c8f581b359ad2e2838a filesize: 65K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:4c5daf36-da19-4ef9-b2bf-58130d1ef311 name: 14_freeze.vcf.gz md5sum: a8b0753bc6e7abc4b237eb01311a601f filesize: 12G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:9ef54eaa-edea-46f1-b528-a8e3f29d15a0 name: 15_freeze data_distributions: - id: alspacdcs:a08d92a9-f555-46ab-afb9-92bc8dcc4d6c name: 15_freeze.vcf.gz.csi md5sum: add4c019bdb6588fa7a215ee326984b5 filesize: 59K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:de91ccf4-39b2-4084-8c6f-79912b8a926b name: 15_freeze.vcf.gz md5sum: a25568744021b27b89e5e402d86e7e74 filesize: 10G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:6208f513-618a-4cbc-8cdb-d5ae602ebf8c name: 16_freeze data_distributions: - id: alspacdcs:e88c099e-7c0b-4473-93bd-51f3083f1b90 name: 16_freeze.vcf.gz.csi md5sum: ee4ce9b3479a9c7ce482b59f0b2bd93d filesize: 58K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:a456feb6-7e1a-498a-8d30-39065f4af552 name: 16_freeze.vcf.gz md5sum: 31b6a197d0d939743199bf622e2abdab filesize: 11G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:4972b5af-7c9f-4004-8c7a-ffa6c490791b name: 17_freeze data_distributions: - id: alspacdcs:a4d3b46b-7ceb-4d59-ad1e-f1856712470f name: 17_freeze.vcf.gz.csi md5sum: 7120caa24202ee64a5a4b99e91bbac4d filesize: 57K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:dca2737f-b31b-4768-ac29-5a0a64b6aa05 name: 17_freeze.vcf.gz md5sum: f9550f900a3304209eeb3cfd74ffb973 filesize: 9.4G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:30815635-3a0d-4bbf-bceb-7128fe47c46b name: 18_freeze data_distributions: - id: alspacdcs:bb32b4e7-82d9-4f95-8d72-2c61f9ef1961 name: 18_freeze.vcf.gz.csi md5sum: b8682ac0f332494d59c44601221b42f3 filesize: 56K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:d47c688b-6819-4469-b3bf-a7dc591abbd9 name: 18_freeze.vcf.gz md5sum: 56120050bd79d79c4cbdf2af9f47cc52 filesize: 9.7G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:8a371ba2-bbba-484f-9cee-9bdce9d4253e name: 19_freeze data_distributions: - id: alspacdcs:93ccee60-f677-4d67-ba9a-30a47638126b name: 19_freeze.vcf.gz.csi md5sum: fd710d3410d43fd0c711eb2685f31d65 filesize: 41K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:366d7fd4-00fd-45a2-b2e5-f7261081b896 name: 19_freeze.vcf.gz md5sum: 286780a3cd433f41669ebcbbb2797592 filesize: 7.2G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:27b7b3c6-0bfd-4dbc-b181-a3c5fe4847df name: 20_freeze data_distributions: - id: alspacdcs:91e493ea-8e4f-4e8e-a22e-8f88c6b6766e name: 20_freeze.vcf.gz.csi md5sum: 2e5117bb3c3e50fb1ad169112975d6c4 filesize: 44K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:871922c6-eef7-4fdd-bcf6-d81c1c984a2e name: 20_freeze.vcf.gz md5sum: 60db6e4041f7d5ec491d5a86ffc92756 filesize: 7.6G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:b2e07654-8076-4fc4-8027-4afa99e34f40 name: 21_freeze data_distributions: - id: alspacdcs:518c911f-b753-4a69-a920-4f04c49a5587 name: 21_freeze.vcf.gz.csi md5sum: 918199bf16c55ee8b099e90c78f7722a filesize: 25K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:b699bbdb-d686-4796-8d25-da87a06cd122 name: 21_freeze.vcf.gz md5sum: cbdc96d94b2a4bdd2265ac364020bd9e filesize: 4.5G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:f2d80d1e-af15-42d0-a8a0-aa18ecf06b2c name: 22_freeze data_distributions: - id: alspacdcs:6a0ae694-8985-419e-a4fb-7a84cd244628 name: 22_freeze.vcf.gz.csi md5sum: 88e896a168ee7773a3bb179985d6dfd9 filesize: 4.6G filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:f0a555db-93f8-470b-9c76-c8f2c7d2bdc5 name: 22_freeze.vcf.gz md5sum: 3dee0194f9f3279d7a9d63df36373c56 filesize: 25K filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:df7a4f81-5f57-4d95-b07a-fe2c81147827 name: 1_freeze data_distributions: - id: alspacdcs:a9048d61-ccb4-4570-be15-66e1f7f5edd3 name: .vcf.gz.csi md5sum: 8015cf7a7c445913ab20e26c82e391d6 filesize: 165K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:0f331c08-9209-4523-bdc2-ba664fbdf290 name: .vcf.gz md5sum: c57904a609267af336527b89c6b7d352 filesize: 28G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:d469118b-f62c-41ef-a686-2de4db94b534 name: 2_freeze data_distributions: - id: alspacdcs:6f05c026-1ede-4518-b911-89e3ae372eaf name: 2_freeze.vcf.gz.csi md5sum: 7614c0827ffd3d6b9330b06e17c63b70 filesize: 177K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:a3a71754-a81a-4b31-8c81-8581ca14d256 name: 2_freeze.vcf.gz md5sum: 2931880238de0df38880a1a23cc2572c filesize: 30G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:3364d4fb-3423-4e1d-8704-6006e15be96c name: 3_freeze data_distributions: - id: alspacdcs:dcae2bd9-658a-4d04-b5bb-fc0c74f9f463 name: 3_freeze.vcf.gz.csi md5sum: ba133388d8dd77faa69f2516f54371b1 filesize: 146K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:c1a8bda7-6453-49f8-852f-4f1445d87940 name: 3_freeze.vcf.gz md5sum: e30f7e0ab96014e4e3c47005f61537d2 filesize: 25G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:8a7521a6-30bf-400e-9608-f16c98e1f174 name: 4_freeze data_distributions: - id: alspacdcs:aaa62301-361f-4779-83e8-8ab6a135a61b name: 4_freeze.vcf.gz.csi md5sum: 28e7addb16dd0405872b09489880b887 filesize: 139K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:526084e4-2c2d-4723-ba10-b6d26364fd99 name: 4_freeze.vcf.gz md5sum: ecb5afbf8da9d64d016ba768807a8744 filesize: 24G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:81b2e40b-f29e-4390-ac7a-dd90f38e69c5 name: 5_freeze data_distributions: - id: alspacdcs:6ca7b22f-f416-439f-ade3-5ec00672c6f6 name: 5_freeze.vcf.gz.csi md5sum: bd20d10d9cd6ba10d4a60ffb80e91466 filesize: 132K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:f67ce6d2-ee1c-4f2e-9841-0f76374f4969 name: 5_freeze.vcf.gz md5sum: df1fb13feb38cdca9e0f8f1e97675be9 filesize: 23G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:8bceb3b2-0bed-4844-b3a8-49b9ab225a21 name: 6_freeze data_distributions: - id: alspacdcs:c29c3da4-6401-4d0a-b2bc-9a8466071b79 name: 6_freeze.vcf.gz.csi md5sum: de346032233062928343da7b881b451e filesize: 125K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:8f4a5481-c05f-4bb8-9c88-50da342a517e name: 6_freeze.vcf.gz md5sum: bba432e85f3fddfa10a6629e55b84ca9 filesize: 22G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:2dd0e39f-ea09-4148-bc79-7b6d0f53e18f name: 7_freeze data_distributions: - id: alspacdcs:86afafd8-47db-4bd6-bf56-44349a0048cf name: 7_freeze.vcf.gz.csi md5sum: 116K filesize: 212200545dbbbdadcb9d46250b29909b filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:a282b5fe-4737-46e7-918e-2c1cb556d7e7 name: 7_freeze.vcf.gz md5sum: 8290747625ec7d1a049771cac46cf508 filesize: 20G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:d109d0e3-17a3-4a16-ae0b-820fce0ac7c8 name: 8_freeze data_distributions: - id: alspacdcs:e4908d8d-5be0-4f8c-9b6c-eaf7699803a1 name: 8_freeze.vcf.gz.csi md5sum: e89d4b8718cf80caf269ad710439e420 filesize: 106K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:9df5ee5c-6804-4fce-a5cc-22acdaed6583 name: 8_freeze.vcf.gz md5sum: dae9706dcaa2ef486df7273a4625cd07 filesize: 20G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:3ccc6e8a-4199-4691-be97-6e1772badab0 name: 9_freeze data_distributions: - id: alspacdcs:0829f3ad-fdfd-4dfc-a955-deb0cb015de1 name: 9_freeze.vcf.gz.csi md5sum: 06983116ea48e1f1476d1284d67a708e filesize: 86K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:bfbd9db9-1df0-491d-85fe-0d491464f8fe name: 9_freeze.vcf.gz md5sum: 67f7dfeff3eeb33b9201c504492b3036 filesize: 15G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:46c814b7-fa5a-4dae-a039-b3f0a8cfc81d name: X_freeze data_distributions: - id: alspacdcs:49783fc3-d727-45ff-9f82-37b7fe28e9c0 name: X_freeze.vcf.gz.csi md5sum: 206d0f2365bdc9b128c8dad17b38039a filesize: 110K filetype: .csi belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571 - id: alspacdcs:be2bfe88-5a33-4046-9f36-db90b1fa108c name: X_freeze.vcf.gz md5sum: cfd01b886762f1a53e6b928a0718f005 filesize: 11G filetype: vcf.bgz belongs_to_container: alspacdcs:90e90672-c949-4ac1-bd68-62a40b9f6571
6. Epigenetic Data
6.1. DNA methylation - 450k - G0 mothers + G1 (dnam_450_g0m_g1)
6.1.1. Description
This dataset contains Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence).
This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/). This dataset is superseded by dnam_epic450_g0_g1.
6.1.2. Methodology
Associated publication: https://doi.org/10.1093/ije/dyv072
Associated R package: https://github.com/MRCIEU/aries
6.1.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:dnam_450_g0m_g1_2016-05-03_f3 name: >- DNA methylation - 450k - G0 mothers + G1 version 2016-05-03 Freeze 3 description: >- This is the third freeze of the 2016-05-03 version of dnam_450_g0m_g1 dataset. freeze_size: 18G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_dnam_450_g0m_g1/releases/tag/Freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:dnam_450_g0m_g1_2016-05-03_f2 freeze_of_alspac_dataset_version: alspacdcs:dnam_450_g0m_g1_2016-05-03 freeze_of_named_alspac_dataset: alspacdcs:dnam_450_g0m_g1 has_containers: - id: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 name: data description: A dir/folder containing the data files - id: alspacdcs:71babe5d-8096-4b81-badd-f092a285d9da name: betas description: A dir/folder containing the beta files belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 - id: alspacdcs:fd31262e-5dcb-48a3-a5b0-5b295110094b name: control_matrix description: A dir/folder containing the control matrix files belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 - id: alspacdcs:0928ae7e-6d94-47ce-9890-a8350bcd46aa name: derived description: A dir/folder containing the derived data (e.g. Cell count predictions) belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 - id: alspacdcs:e877f56c-6174-4427-a8ba-333b5632d85a name: cellcounts description: A dir/folder containing the cell count predictions belongs_to_container: alspacdcs:0928ae7e-6d94-47ce-9890-a8350bcd46aa - id: alspacdcs:8b59a158-94d0-4244-a779-f4695ceb3d9a name: cord description: >- A dir/folder containing the cell count predictions for cord. belongs_to_container: alspacdcs:e877f56c-6174-4427-a8ba-333b5632d85a - id: alspacdcs:3279aec3-c6c3-4d04-809e-94eadc51c0c8 name: andrews-and-bakulski description: >- A dir/folder containing the cell count predictions by andrews-and-bakulski algorithm belongs_to_container: alspacdcs:8b59a158-94d0-4244-a779-f4695ceb3d9a - id: alspacdcs:8abae404-42a9-452a-9a26-7f6c8eed5c6b name: gervinandlyle description: >- A dir/folder containing the cell count predictions by gervinandlyle algorithm/method. belongs_to_container: alspacdcs:8b59a158-94d0-4244-a779-f4695ceb3d9a - id: alspacdcs:021ad3f5-6e32-42c0-91c6-f996a9b6e62b name: gse68456 description: >- A dir/folder containing the cell count predictions by the gse68456 method. belongs_to_container: alspacdcs:8b59a158-94d0-4244-a779-f4695ceb3d9a - id: alspacdcs:ea167030-d783-46c5-b8d5-3cbd9431f396 name: houseman description: >- A dir/folder containing the cell count predictions by houseman method. belongs_to_container: alspacdcs:e877f56c-6174-4427-a8ba-333b5632d85a - id: alspacdcs:6e79ad66-78a5-4102-a071-7c259151d0af name: detection_p_values description: A dir/folder containing the matrix of detection values belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 - id: alspacdcs:4e32e07e-181d-46d2-b134-71ee5f6bd53e name: qc.objects_all description: >- A dir/folder containing the samples extracted from lims and not cleaned. belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 - id: alspacdcs:5a4d2e29-aa60-493e-a33c-7bcb63be8088 name: qc.objects_clean description: A dir/folder containing the cleaned samples from Lims belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 - id: alspacdcs:3275bc26-4695-43b8-915e-4bbc4d13018f name: samplesheet description: A dir/folder containing the manifest file from Lims. belongs_to_container: alspacdcs:ea27b439-5647-4656-b3dd-568437a9d972 has_parts: - id: alspacdcs:5e9f67ac-4ddd-4535-9991-ea99fa112d45 name: betas description: >- Normalized betas using functional normalization. We used 10 PCs on the controlmatrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:8cace59a-0aef-4977-95eb-d6ef0bccc8b6 name: data.Robj description: >- R data object for the Normalized beta data. md5sum: f28327f68c4286c3e0ae721020f55f49 filesize: 17G filetype: .Robj belongs_to_container: alspacdcs:71babe5d-8096-4b81-badd-f092a285d9da number_of_participants: - id: alspacdcs:1100e73c-c40d-41d8-943b-978a155fbc5e name: control matrix description: >- The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. data_distributions: - id: alspacdcs:eaef665f-213a-40e7-9c4b-65dfc1955623 name: data.txt description: >- Plain text file of the control matrix. md5sum: 443369530a5d75fdac8c1cac2fe45e15 filesize: 1.8M filetype: .txt belongs_to_container: alspacdcs:fd31262e-5dcb-48a3-a5b0-5b295110094b number_of_participants: - id: alspacdcs:e87e270b-b4e9-45c8-85a9-80489d1a99d3 name: andrews and bakulksi cord cell counts description: >- Cellcounts in cord predicted using cord reference published in Bakulski et al 2016 (PMID: 27019159). This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:c2b9acf3-09f5-4d38-8d66-26a37c8f804c name: data.txt description: >- Plain text file of cellcounts in cord predicted using Bakulski. md5sum: 79b04868cc502a1a34ade01958f22790 filesize: 118k filetype: .txt belongs_to_container: alspacdcs:3279aec3-c6c3-4d04-809e-94eadc51c0c8 number_of_participants: 912 - id: alspacdcs:d059b0ff-65c1-4e39-8d4a-b129b5898811 name: geervin and lyle cord cell counts description: >- Cellcounts in cord predicted using GervinandLyle cord reference (unpublised). This reference has been implemented in meffil. Samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:877af9e8-ecfb-46d6-8767-a76ee4c68b2c name: data.txt description: >- Plain text file of cell counts predicted using GervinandLyle cord reference. md5sum: 0d8535330ac6e12e7f3c5a5f3f30e600 filesize: 100k filetype: .txt belongs_to_container: alspacdcs:8abae404-42a9-452a-9a26-7f6c8eed5c6b number_of_participants: 912 - id: alspacdcs:8cc1141b-13da-483a-aa47-2b3ca5b7b1c1 name: gse68456 cord cell counts description: >- Cellcounts in cord predicted using cord reference published in de Goede et al (PMID: 26366232). This reference has been implemented in meffil. Samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:4df00b08-2234-4231-a408-c17f64f8e75d name: data.txt description: >- Plain text file containinng cell counts predicted using cord reference. md5sum: 837e1e40bf27d8f6bd1a402f016b798e filesize: 120k filetype: .txt belongs_to_container: alspacdcs:021ad3f5-6e32-42c0-91c6-f996a9b6e62b number_of_participants: 912 - id: alspacdcs:1153615c-a3d4-4bdf-a294-293994144626 name: houseman cell counts description: >- Cell counts extracted using Houseman algorithm implemented in meffil (PMID: 22568884). Samples are in rows and cell types in columns. data_distributions: - id: alspacdcs:4b430991-4329-415c-8781-9f12e7944359 name: data.txt description: >- Text file of the cell counts calculated using Houseman algorithm. md5sum: 2792f7708e710536c069b05c0192c57d filesize: 569k filetype: .txt belongs_to_container: alspacdcs:ea167030-d783-46c5-b8d5-3cbd9431f396 number_of_participants: 4843 - id: alspacdcs:0225c24c-a4c6-4c29-a791-71ee7049f899 name: detection p values description: >- This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:284e4a48-0ec9-4988-bbc9-55c752e94145 name: data.Robj description: >- R object file for the detection p values matrix md5sum: 5b3445d77c5f212dcd10b1645aca7632 filesize: 418M filetype: .Robj belongs_to_container: alspacdcs:6e79ad66-78a5-4102-a071-7c259151d0af number_of_participants: - id: alspacdcs:183a7d3b-16c9-427c-a2ff-ff4f303bdad6 name: qc objects all description: >- This objects contain samples extracted from LIMS and is not cleaned up. This object has been used to do the data cleaning. All data processing has been conducted using Meffil. Meffil uses illuminaio R package to parse Illumina IDAT files into a meffil object called qc.objects. All meffil functions, QC summary, functional normalization and post-normalization QC summary operate on the qc or norm.objects. Specifically, the qc.objects contain raw control probe intensities, poor quality probes based on detection Pvalues and number of beads, predicted sex, predicted cellcounts and a samplesheet with batch variables. In addition, copy number variation can be extracted. This object is a list of individuals. data_distributions: - id: alspacdcs:32c99449-9401-4ccc-8806-1476a535acae name: data.Robj description: >- R data file of the qc objects. md5sum: 4e754e357d16b507650a5c5f56621dd3 filesize: 497M filetype: .Robj belongs_to_container: alspacdcs:4e32e07e-181d-46d2-b134-71ee5f6bd53e number_of_participants: - id: alspacdcs:76972370-cdf4-4887-b4a5-14fe31236813 name: qc objects clean description: >- All data processing has been conducted using Meffil. Meffil uses illuminaio R package to parse Illumina IDAT files into a meffil object called norm.objects. All meffil functions, QC summary, functional normalization and post-normalization QC summary operate on the norm.objects. Specifically, the norm.objects contain raw control probe intensities, quantile distributions of the raw intensities, poor quality probes based on detection Pvalues and number of beads, predicted sex, predicted cellcounts and a samplesheet with batch variables. In addition, copy number variation can be extracted. This object is a list of individuals. data_distributions: - id: alspacdcs:5f2b149e-73dd-44b9-ab15-58d8ffded660 name: data.Robj description: >- R object file of qc objects clean. md5sum: c69ed033e28ea6f822a85c165cf78b83 filesize: 659M filetype: .Robj belongs_to_container: alspacdcs:5a4d2e29-aa60-493e-a33c-7bcb63be8088 number_of_participants: - id: alspacdcs:cfd86d55-286a-42cf-86af-ac72ffce4893 name: samplesheet description: >- Manifest file with columns extracted directly from LIMS and age, sex, aln, timepoint, timecode, sampletype, genotypeQC columns to remove population stratification samples, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. data_distributions: - id: alspacdcs:727cb669-bda3-44c7-adac-57f67f53eb41 name: data.Robj description: >- R data object manifest file. md5sum: f47ad58a27ebd89d3fe3c81d25b4dc08 filesize: 100K filetype: .Robj belongs_to_container: alspacdcs:3275bc26-4695-43b8-915e-4bbc4d13018f number_of_participants: 4843
6.2. DNA methylation - EPIC & 450k - G0 + G1 (dnam_epic450_g0_g1)
6.2.1. Description
This dataset contains methylation data collected from both G0 and G1 on two arrays at different timepoints. This dataset supersedes dnam_450_g0m_g1.
There is data from Illumina Infinium HumanMethylation450K BeadChip array on G1 mothers at two timepoints (pregnancy and middle age), G1 participants at 5 timepoints and G0 participants at three timepoints (birth, childhood and adolescence). This dataset also contains data from Infinium MethylationEPIC v1.0 data on 2721 G1 individuals at 2 timepoints.
This dataset was generated as part of the Accessible Resource for Integrated Epigenomics Studies (http://www.ariesepigenomics.org.uk/).
6.2.2. Methodology
Preprocessing and quality control for this dataset was conducted using Meffil.
Associated publications:
Associated R packages:
- aries: https://github.com/MRCIEU/aries is associated with loading and using this dataset.
- meffil: https://github.com/perishky/meffil/ was used for QC and normalisations within
6.2.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:dnam_epic450_g0_g1_2022-7-13_f3 name: >- DNA methylation - EPIC & 450k - G0 + G1 version 2022-7-13 Freeze 3 description: >- This is the freeze 3 version of dnam_epic450_g0_g1, which was first introduced in freeze 2 and first released 2022-7-13. freeze_size: 137G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_dnam_epic450_g0_g1/releases/tag/Freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 ### Update to align with date of release previous_freeze: 2 freeze_of_alspac_dataset_version: alspacdcs:dnam_epic450_g0_g1_2022-7-13 freeze_of_named_alspac_dataset: alspacdcs:dnam_epic450_g0_g1 has_containers: - id: alspacdcs:fb5112cd-dab1-4616-be19-507aedb071cb name: data description: A dir/folder containing the data files - id: alspacdcs:300c72c2-a2aa-44c1-b8a3-ef89140e65ce name: betas description: A dir/folder containing the beta files belongs_to_container: alspacdcs:fb5112cd-dab1-4616-be19-507aedb071cb - id: alspacdcs:15cf4ce4-d704-4d08-85f1-4d6d0fd02792 name: control_matrix description: A dir/folder containing the control matrix files belongs_to_container: alspacdcs:fb5112cd-dab1-4616-be19-507aedb071cb - id: alspacdcs:d20f9a3c-b6e4-4b41-a302-621192b12124 name: derived description: A dir/folder containing the derived data (e.g. Cell count predictions and dnamage) belongs_to_container: alspacdcs:fb5112cd-dab1-4616-be19-507aedb071cb - id: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f name: cellcounts description: A dir/folder containing the cell count predictions belongs_to_container: alspacdcs:d20f9a3c-b6e4-4b41-a302-621192b12124 - id: alspacdcs:37ad6257-6c83-4467-a299-98268099f09a name: detection_p_values description: A dir/folder containing the matrix of detection values belongs_to_container: alspacdcs:fb5112cd-dab1-4616-be19-507aedb071cb - id: alspacdcs:5ba9bdfa-dd6b-40fd-aeaf-a43b55186c56 name: samplesheet description: A dir/folder containing matrices of the sample identification. belongs_to_container: alspacdcs:fb5112cd-dab1-4616-be19-507aedb071cb has_parts: - id: alspacdcs:8bc24e6e-7577-43dd-91a8-a298822f568c name: betas description: >- Normalized betas using functional normalization. We used 10 PCs on the controlmatrix to regress out technical variation. Slide was regressed out as random effect before normaliziation. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:bff01858-14f6-4d31-be4a-cc5b0da20327 name: 450.gds description: >- R data object for the Normalized beta data for the 450 array only. md5sum: 02e9b3cdda39d3476bfce111f5935f93 filesize: 22G filetype: .gds belongs_to_container: alspacdcs:300c72c2-a2aa-44c1-b8a3-ef89140e65ce number_of_participants: 5927 - id: alspacdcs:598ac47f-876d-472a-b6b5-c35bc8101a5f name: common.gds description: >- R data object for the Normalized beta data for both the EPIC and 450 arrays. md5sum: 26a3ccd7c99f8074522295d649f277bf filesize: 30G filetype: .gds belongs_to_container: alspacdcs:300c72c2-a2aa-44c1-b8a3-ef89140e65ce number_of_participants: 8670 - id: alspacdcs:955e38f2-bf05-42e9-a15a-0b3091bb5066 name: epic.gds description: >- R data object for the Normalized beta data for the EPIC array only. md5sum: 2433412ede73c7bb85eee51763c6797b filesize: 18G filetype: .gds belongs_to_container: alspacdcs:300c72c2-a2aa-44c1-b8a3-ef89140e65ce number_of_participants: 2743 - id: alspacdcs:7306504c-e329-43c6-a65b-428c8e3fd6bf name: control_matrix description: >- The 850 control probes are summarized in 42 control types. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate. None of these probes are designed to measure a biological signal. The summarized control probes can be used as surrogates for unwanted variation and are used for the functional normalization. Samples are rows and 42 control types are in columns. data_distributions: - id: alspacdcs:1e616a92-34b9-4607-898d-062d5bd735ca name: 450.txt description: >- Plain text file of the control matrix for the 450 array only. md5sum: 9e6aa62498c5bb7493f7512e274056ba filesize: 2.1M filetype: .txt belongs_to_container: alspacdcs:15cf4ce4-d704-4d08-85f1-4d6d0fd02792 number_of_participants: - id: alspacdcs:3506af8d-8d94-4a2d-936e-9bcf306027ca name: common.txt description: >- Plain text file of the control matrix for both the EPIC and 450 arrays. md5sum: f3d58c03dafcd4fc10292fc1338d34f7 filesize: 3.2M filetype: .txt belongs_to_container: alspacdcs:15cf4ce4-d704-4d08-85f1-4d6d0fd02792 number_of_participants: - id: alspacdcs:5877e6be-2c2b-4dc8-b534-3989b175bee0 name: epic.txt description: >- Plain text file of the control matrix for the EPIC array only. md5sum: ea20f22f63bf9855a0e159945cbc10e3 filesize: 1010K filetype: .txt belongs_to_container: alspacdcs:15cf4ce4-d704-4d08-85f1-4d6d0fd02792 number_of_participants: - id: alspacdcs:b73ed28a-7219-49b5-94b8-e39d2bbda6f2 name: DNA methylation age description: >- DNA methylation aging estimates from within the dataset. Further information on this data and its usage is found within the `dnamage.html` and `dnamage.md` within the docs dir/folder. data_distributions: - id: alspacdcs:a3963582-7e06-433b-8694-77821324cc5d name: dnamage.csv description: >- A csv file containing DNA methylation aging estimates within the dataset. md5sum: 668bfe2a3c713801eb1b02e920eab964 filesize: 12M filetype: .csv belongs_to_container: alspacdcs:d20f9a3c-b6e4-4b41-a302-621192b12124 number_of_participants: 8192 - id: alspacdcs:453b4986-d992-4525-98a9-051b55a9296e name: cell counts description: >- Files contain cell counts estimated using a variety of cell type references using the Houseman deconvolution algorithm (PMID: 22568884). In each file, samples correspond to rows and cell types to columns. data_distributions: - id: alspacdcs:c39f3a33-8b19-414f-a8bc-d193847314d7 name: andrews and bakulski cord blood.txt description: >- Cord blood cell count estimates derived using the Bakulski et al. 2016 reference (PMID 27019159; https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBlood.450k.html). This reference has been implemented in meffil. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns. md5sum: 33c69aa8e50deb28355dcb82d01c7510 filesize: 114K filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 914 - id: alspacdcs:ae1ac6fc-baf0-4098-aa0a-76f499119645 name: gervin and lyle cord blood.txt description: >- Cord blood cell count estimates derived using the Gervin et al. 2019 reference (PMID 31455416; GEO accession GSE127824). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 099c4cf9bd4ecfee91c19c3c2d2b6f70 filesize: 100K filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 914 - id: alspacdcs:d0d978ae-82b9-4820-89b0-698a325e208a name: cord blood gse68456.txt description: >- Cord blood cell count estimates derived using the de Goede et al. 2015 reference (PMID 26366232; GEO accession GSE68456). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 941f8a9ce1289ab5baaf10fb29bd8941 filesize: 130K filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 914 - id: alspacdcs:1026dca3-8263-443f-9ae5-f07401987e2f name: blood gse35069 complete.txt description: >- Cell counts in peripheral blood predicted using the peripheral blood reference published in Reinius et al. 2012 (PMID: 22848472). Same as 'blood gse35069.txt' but replaces granulocytes with eosinophils and neutrophils. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: c8c1b071dfe501f54d59bd12757a2de0 filesize: 1.2M filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 8671 - id: alspacdcs:4fa62702-1eaa-4485-94dc-87c3d84c439b name: blood gse35069.txt description: >- Blood cell count estimates derived using the Reinius et al. 2012 reference (PMID 25424692; GEO accession GSE35069). Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, and natural killer cells. In this text file, samples are in rows and cell types in columns. md5sum: ba915880238cdec1f71b681e4b756d02 filesize: 1021K filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 8671 - id: alspacdcs:588856a5-0614-4475-a2ae-f0b22f51326c name: blood idoloptimized epic.txt description: >- Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 0919908374891d1eb61c1cb4e12d8679 filesize: 347K filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 2744 - id: alspacdcs:5265d49a-b7b2-4907-9c4b-d8b71a7ba516 name: blood idoloptimized.txt description: >- Cell counts in peripheral blood predicted using the cell type reference from Bioconductor package FlowSorted.Blood.EPIC but restricted to the IDOLOptimizedCpGs450klegacy CpG sites. This reference has been implemented in meffil. In this text file, samples are in rows and cell types in columns. md5sum: 5a906d07d8c5c8ad4d85c0b436412666 filesize: 1.1M filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 8671 - id: alspacdcs:bd4f6a06-5c39-4817-be00-18f920ef463c name: combined cord blood.txt description: >- Cord blood cell count estimates derived using the Bakulski et al, Gervin et al., de Goede et al., and Lin et al. references (https://bioconductor.org/packages/release/data/experiment/html/FlowSorted.CordBloodCombined.450k.html) for CpG sites selected using the IDOL algorithm and optimized for the Illumina Infinium HumanMethylation450 Beadchip. Cell counts estimated for b-cells, cd4+ t cells, cd8+ t cells, granulocytes, monocytes, natural killer cells and nucleated red blood cells. In this text file, samples are in rows and cell types in columns. md5sum: 7cbcf72ca00012d17d22ff6d21b7575c filesize: 129K filetype: .txt belongs_to_container: alspacdcs:05b2cd5c-e6ab-455d-a66e-153571d40a4f number_of_participants: 914 - id: alspacdcs:82233c09-cafd-4128-8bdb-e00ae7d86dd5 name: detection p values description: >- This matrix shows the detection pvalues for each sample and each CpG and is extracted from the idat files using the "meffil.load.detection.pvalues" function in meffil. CpGs are in rows and samples in columns. data_distributions: - id: alspacdcs:284e4a48-0ec9-4988-bbc9-55c752e94145 name: 450.gds description: >- R object file for the detection p values matrix for the 450 array only. md5sum: 1c437226b2aab0c00aed7098e739f49d filesize: 22G filetype: .gds belongs_to_container: alspacdcs:37ad6257-6c83-4467-a299-98268099f09a number_of_participants: 5927 - id: alspacdcs:4c208bea-3511-11ee-be56-0242ac120002 name: common.gds description: >- R object file for the detection p values matrix for both EPIC and 450 arrays. md5sum: 538ffd8177ecb8adcba5095a7d5f75c0 filesize: 30G filetype: .gds belongs_to_container: alspacdcs:37ad6257-6c83-4467-a299-98268099f09a number_of_participants: 8670 - id: alspacdcs:51c397c2-3511-11ee-be56-0242ac120002 name: epic.gds description: >- R object file for the detection p values matrix for the EPIC array only. md5sum: 542967aac7f77f2c2c8208df37283f29 filesize: 18G filetype: .gds belongs_to_container: alspacdcs:37ad6257-6c83-4467-a299-98268099f09a number_of_participants: 2743 - id: alspacdcs:19ee6082-058b-446f-b424-76e75b641766 name: samplesheet description: >- Manifest files with columns extracted directly from LIMS and age, sex, omics ID, timepoint, timecode, sampletype, genotype columns to report sample mismatches, duplicate.rm column to remove duplicates. Samples in rows, variables in columns. data_distributions: - id: alspacdcs:fc5413e9-cc79-4cbd-8ca9-ee722ee18c0d name: samplesheet-450.csv description: >- R data object manifest file for the 450 array only. md5sum: 9410525be519472de354134626192864 filesize: 2.2M filetype: .csv belongs_to_container: alspacdcs:5ba9bdfa-dd6b-40fd-aeaf-a43b55186c56 number_of_participants: 5927 - id: alspacdcs:a5f905c8-6c47-4168-92f5-1bd9bfa59a6e name: samplesheet-common.csv description: >- R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet.csv. md5sum: 164b6418a8c6a0f5dbea3da5255b1b96 filesize: 3.2M filetype: .csv belongs_to_container: alspacdcs:5ba9bdfa-dd6b-40fd-aeaf-a43b55186c56 number_of_participants: 8670 - id: alspacdcs:376cb854-a0ee-4735-bcac-108795a3b9c3 name: samplesheet-epic.csv description: >- R data object manifest file for the EPIC array only. md5sum: f5a8ba932af085c0a9e1603afc94d23d filesize: 1.1M filetype: .csv belongs_to_container: alspacdcs:5ba9bdfa-dd6b-40fd-aeaf-a43b55186c56 number_of_participants: 2743 - id: alspacdcs:38c5b752-447c-4270-874e-fbf60825a0bc name: samplesheet.csv description: >- R data object manifest file for both the EPIC and 450 arrays. This is a duplicate with samplesheet-common.csv. md5sum: 164b6418a8c6a0f5dbea3da5255b1b96 # should be the same as samplesheet-common.csv filesize: 3.2M filetype: .csv belongs_to_container: alspacdcs:5ba9bdfa-dd6b-40fd-aeaf-a43b55186c56 number_of_participants: 8670
7. Gene Expression Data
7.1. Gene expression - array - G1 (ge_ht12_g1)
7.1.1. Description
There are two different types of QC'd data available in this version, one performed by David Evans for the Bryois et al 2014 paper, and one performed by Gibran Hemani for the molgenis eQTL mapping meta analysis. A version without QC is available as well. Details on the QC'd versions can be seen below.
7.1.2. Methodology
Bryois:
- LCL's from unrelated individuals were grown under identical conditions and cells frozen in RNAlater. RNA was extracted using an RNeasy extraction kit (Qiagen) and was amplified using the Illumina TotalPrep-96 RNA Amplification kit (Ambion). Expression profiling of the samples, each with two technical replicates, were performed using the Illumina Human HT-12 V3 BeadChips (Illumina Inc) including 48,804 probes where 200 ng of total RNA was processed according to the protocol supplied by Illumina. Raw data was imported to the Illumina Beadstudio software and probes with less than three beads present were excluded. Log2 - transformed expression signals were then normalized with quantile normalization of the replicates of each individual followed by quantile normalization across all individuals. We restricted our analysis to 23'935 probes tagging genes annotated in Ensembl. Principal component analysis was performed on 931 individuals. 62 individuals with principal component 1 or 2 greater than one standard deviation of the population were excluded from further analysis. See http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004461 for full details.
Molgenis:
- Genetic outliers were removed, any individuals that were clear outliers in the first 2 genetic principal components. Each probe was simply quantile normalised and then log2 transformed. Then adjusted for the first 4 genetic MDS, expression principal components (excluding those that had genetic associations), and scaled to have mean 0 and variance 1. See https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook for full details.
7.1.3. Freeze Docs
# This yaml file is a description of a freeze of a released version of a named alspac dataset # It should conform to the schema https://github.com/alspac/alspac-data-catalogue-schema id: alspacdcs:ge_ht12_g1_2015-11-02_f3 name: Gene expression - array - G1 release version 2015-11-02 freeze 3 description: >- This is the third freeze of the 2015-11-02 version of ge_ht12_g1 dataset which has .csv distributions of the data rather than .Rdata files in order to be easier to use across differnt data science software and languages. freeze_size: 2.6G linker_file_md5sum: b528acad88cd1697129a7cd59aa14ada woc_file_md5sum: cf9249c306e766a8689f78197e1f5f25 all_individuals_to_exclude_md5sum: 7faad74aeebaba4ed71aac783414d75b git_tag: https://github.com/alspac/dataset_ge_ht12_g1/releases/tag/freeze3 is_current_freeze: true freeze_number: 3 freeze_date: 2023-09-13 previous_freeze: alspacdcs:ge_ht12_g1_2015-11-02_f2 freeze_of_alspac_dataset_version: alspacdcs:ge_ht12_g1_2015-11-02 freeze_of_named_alspac_dataset: alspacdcs:ge_ht12_g1 has_parts: - id: alspacdcs:ge_ht12_g1_2015-11-02_bryosis_f3 name: Bryosis data description: Dataset part for the Bryosis data in ge_ht12_g1 version 2015-11-02 freeze3 data_distributions: - id: alspacdcs:272eb7917f3ef253f0b65f7b01d35574_bryosis.csv name: bryosis.csv description: >- The freeze 3 csv version of the bryosis data. IDs in columns and Illumina probe IDs in rows. This is the normalised data used in Bryois et al 2014. Probe IDs are mapped to Genes in raw.csv md5sum: 272eb7917f3ef253f0b65f7b01d35574 filesize: 742M filetype: .csv number_of_participants: 947 number_of_gene_expression_probe_values: 48630 - id: alspacdcs:ge_ht12_g1_2015-11-02_molgenis_f3 name: Molgenis description: >- Dataset part for the Molgenis data in ge_ht12_g1 version 2015-11-02 freeze3 data_distributions: - id: alspacdcs:d7a6826fe6b4d3a0c853ec7eaa8e55e6_molgenis.csv name: molgenis.csv description: >- The freeze 3 csv version of the molgenis data. IDs in columns and Illumina probe IDs in rows. Normalised data following the molgenis pipeline, found at https://github.com/molgenis/systemsgenetics/wiki/eQTL-mapping-analysis-cookbook. Probe IDs are mapped to Genes in raw.csv md5sum: d7a6826fe6b4d3a0c853ec7eaa8e55e6 filesize: 752M filetype: .csv number_of_participants: 879 number_of_gene_expression_probe_values: 48630 - id: alspacdcs:ge_ht12_g1_2015-11-02_raw_f3 name: Raw description: Dataset part for the raw data in ge_ht12_g1 version 2015-11-02 freeze3 data_distributions: - id: alspacdcs:1e559d3a25a10f1f11387325981882a8_raw.csv name: raw.csv description: >- The freeze 3 csv version of the raw ge data. IDs in columns and probes in rows. Two columns per individual, with one column for average signal and one column for average number of beads. Presumably this is a file generated by the Illumina Genome Studio software. md5sum: 1e559d3a25a10f1f11387325981882a8 filesize: 1.1G filetype: .csv number_of_participants: 994 ##This is not how wide this dataframe is number_of_gene_expression_probe_values: 48630
8. Omics tips
8.1. Introduction
This section is a guide to using 'Omics datasets. It explains which software to use and describes common file formats. It's a good starting point for beginners and helpful for problem-solving.
8.2. Disclaimer
Some information is copied or reworded from software documentation. Check the original documentation alongside this guide for up-to-date information. Note that some links may no longer work.
8.3. Operating systems
You can use ALSPAC data with any operating system, but Unix-based systems like Macintosh, Linux, or BSD are more convenient due to the data's size and complexity. We recommend using the command line and programming scripts with languages like Bash, R, Python, or Perl. Many online resources are available to learn these tools. Use free/libre and open-source software where possible.
Links:
- Unix guide: https://www.osc.edu/supercomputing/unix-cmds
- Beginning Python: https://www.python.org/about/gettingstarted/
- Beginning R: https://www.statmethods.net/r-tutorial/index.html
- Free/libre and open-source software: https://www.fsf.org/about/
8.4. Key Omics software
8.4.1. Plink
Plink is a tool for performing quality control and whole genome association analysis of genetic data.
8.4.2. SNPTest
SNPTest is a tool for performing whole genome association analysis of genetic data.
- Link: https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html (Not open source)
8.4.3. BoltLmm
BoltLmm is a tool for performing genome association analysis of genetic data. It is recommended for analysis of more than 5000 samples, its methods automatically take into account population substructures.
8.4.4. Qctools
A tool for quality control of genetic data. It is also useful to inspect and modify .gen .bgen and vcf files etc (see section 4 below).
8.4.5. SAMTOOLS
Samtools is a suite of tools which are used for genomic analysis.
- Link: http://www.htslib.org/
8.4.6. VCFTOOLS
Part of samtools that allows you to work with vcf files.
8.4.7. BCFTOOLS
This is a part of samstools and allows users to manipulate .bcf files.
8.5. File types
In a Unix environment the postfix of a file name does not explicitly mean anything to the operating system, unlike in a Windows system which will look at the file types. In a Unix system it is just part of the name of the file and humans use it to distinguish file formats. The following is a non-exhaustive list of file types you may encounter whilst using ALSPAC Omics data.
8.5.1. .gen
This is an 'oxford' data format for genetic data. The .gen file is a plain text file, this means that standard Unix command line tools can be used to inspect the data. For example, 'head' or 'less'.
The .gen (genotype) file stores data on a one-line-per-SNP format. The first 5 entries of each line are the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line are the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers are the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). It should be noted that the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO. NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier. For example, Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are
SNP 1 | AA | AA |
SNP 2 | GG | GT |
SNP 3 | CC | CT |
SNP 4 | CT | CT |
SNP 5 | AG | GG |
The correct genotype file would look like this:
SNP1 rs1 1000 | A | C | 1 | 0 | 0 | 1 | 0 | 0 |
SNP2 rs2 2000 | G | T | 1 | 0 | 0 | 0 | 1 | 0 |
SNP3 rs3 3000 | C | T | 1 | 0 | 0 | 0 | 1 | 0 |
SNP4 rs4 4000 | C | T | 0 | 1 | 0 | 0 | 1 | 0 |
SNP5 rs5 5000 | A | G | 0 | 1 | 0 | 0 | 0 | 1 |
8.5.2. .bgen
A binary version of a .gen file. This file can not be visually inspected on the command line. .bgen files are used because they greatly increase the speed and storage efficiency of software for storing large amounts of Omics data. The full details of the file format are discussed in : https://www.well.ox.ac.uk/~gav/bgen_format/ bgen files are normally used with tools such as qctools and snptest There is also a library for reading .bgen files into R : https://bitbucket.org/gavinband/bgen/wiki/rbgen
8.5.3. .sample
The .sample file is paired with either .gen or .bgen files. It contains information on the samples that is not genetic. It is a plain text file that can be inspected with standard Unix command line tools.
Please note that the sample file format changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference
ID_1 | ID_2 | missing | cov_1 | cov_2 | cov_3 | cov_4 | pheno1 | bin1 |
0 | 0 | 0 | D | D | C | C | P | B |
1 | 1 | 0 | .007 | 1 | 2 | 0 | .0019 | -0.008 1.233 1 |
2 | 2 | 0 | .009 | 1 | 2 | 0 | .0022 | -0.001 6.234 0 |
3 | 3 | 0 | .005 | 1 | 2 | 0 | .0025 | 0.0028 6.121 1 |
4 | 4 | 0 | .007 | 2 | 1 | 0 | .0017 | -0.011 3.234 1 |
5 | 5 | 0 | .004 | 3 | 2 | -0 | .012 | 0.0236 2.786 0 |
The header line: This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. NOTE : All phenotypes should appear after the covariates in this file. The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules
D | Discrete covariate (coded using positive integers) |
C | Continuous covariates |
P | Continuous Phenotype |
B | Binary Phenotype (0 = Controls, 1 = Cases) |
The remainder of the file should consist of a line for each individual containing the information specified by the entries of the header line (see example above). Use spaces to separate the entries of the sample file and not TABS because that is the expected character.
Missing values - Specifying missing values for covariates and phenotypes is possible. It was recommended that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 the behavior of the -missing_code option has changed so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). Default missing values are now denoted by the two character string "NA".
8.5.4. .ped
A plink format file that is in plain text and can be viewed with standard tools. It contains genetic variant data. https://www.cog-genomics.org/plink/1.9/formats#ped
8.5.5. .map
A plink format file that is in plain text. It contains information about variants. https://www.cog-genomics.org/plink/1.9/formats#map
8.5.6. .bed
A plink format file that isa binary equivalent of a .ped file. It is smaller and faster to process but is not easily viewable or editable. https://www.cog-genomics.org/plink/1.9/formats#bed
8.5.7. .bim
A plink format, similar to a .map file but is used with binary .bed files. https://www.cog-genomics.org/plink/1.9/formats#bin
8.5.8. .fam
A plain text format that contains sample information for plink binary files. https://www.cog-genomics.org/plink/1.9/formats#fam
8.5.9. .csv
A plain text format where different fields are separated by commas. (Comma separated variables).
8.5.10. .vcf
VCF files are a flexible file format for storing different types of genetic variants. They are a plain text format that can be inspected on the command line with standard Unix tools. However they are often very large files, and specific tools such as 'vcftools' are useful for working with this data. Commonly SNPs are stored in these files but other variants such as Copy Number variations can also be stored. The basic form for a vcf file is: https://en.wikipedia.org/wiki/Variant_Call_Format
8.5.11. .bcf
This is a binary version of a vcf file. It cannot be inspected on the command line, but can be used with the genomic tools mentioned in this document.
8.5.12. .tar.gz
This is a standard Unix file format for bundling and compressing a set of files. It is similar to a .zip file. It is made by first bundling a set of files into a .tar file (sometimes called a tar ball). This is then compressed using 'gun zip'. https://en.wikipedia.org/wiki/Tar_(computing) https://en.wikipedia.org/wiki/Gzip
8.5.13. .enc
This file extension is used as a convention to mean that the file is encrypted. You will need to have that password that was used to encrypt the data in order to unencrypt the files. https://en.wikipedia.org/wiki/OpenSSL
8.6. Variant/SNP ids
There are many types of genetic variation. A common type is a single nucleotide polymorphism (SNP). Others include copy number variations.
Variants can be specified by a Chromosome and location in reference to a specific build of the human genome. They can also be given a reference SNP (rs) cluster identifier.
- Chr:Location
- Rs ids
8.7. Overview of Imputation reference panels
SNP array data frequently contain hundreds of thousands of variants. However due to linkage disequilibrium it is possible to estimate many more SNP values for an individual. This estimation procedure is called imputation and it works by combining an individuals SNP array data with a large reference population of sequenced data. In this way it is possible to have accurate estimations of millions of SNP values for an individual without the cost of fully sequencing each person. ALSPAC has prerun the imputation process using three different imputation panels.
8.7.1. Panels
- TOPmed
An upcoming (to alspac) reference panel which will have the most snps
- HRC
This is the latest reference panel and our data contains circa 40 millions of SNPs.
- 1000 Genomes
This is the previous generation reference panel which is still widely used in ALSPAC studies. There are some SNPs that appear in this panel that are not in the HRC panel.
- Hapmap
This was the first widely used imputation panel.
8.8. SNP data types from imputation.
SNPs that have been imputed can be stored and analysed in different formats. These can be appropriate for different types of analysis, for example an analysis could assume and additive effect for the minor allele or it could assume a recessive/dominant effect.
- Best guess. The data will be presented as either 0,1, or 2 to represent how many of the minor alleles at that position a person has. The best guess is derived from the probability of a variant calculated from the imputation process.
- Dosage. This is the probability that the person has 0, 1 or 2 of the minor allele. i.e. 0.1, 0.2,0.7. This will sum to one across the three possibilities (i.e for each SNP for each individual).
8.9. SNP Statistics
You can generate statistics on your SNP data using the program 'QCtools'. This will give you the imputation information scores. For example:
qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt
8.10. Best practice
8.10.1. GWAS
We recommend you follow the steps outlined in the following paper when performing GWAS: Marees, Andries T., et al. "A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis." International journal of methods in psychiatric research 27.2 (2018): e1608. https://doi.org/10.1002/mpr.1608
8.10.2. Phewas
We recommend you follow the steps outlined in the following paper when performing Phewas: Millard, L., Davies, N., Timpson, N. et al. MR-PheWAS: hypothesis prioritization among potential causal effects of body mass index on many outcomes, using Mendelian randomization. Sci Rep 5, 16645 (2015). https://doi.org/10.1038/srep16645
8.10.3. Methylation
The following paper describes the methylation data available in ALSPAC Relton, Caroline L., et al. "Data resource profile: accessible resource for integrated epigenomic studies (ARIES)." International journal of epidemiology 44.4 (2015): 1181-1190.
8.11. Population stratification
This is when an observed genetic association is due to the population/geography. Not taking this into account can lead to biased estimates of effects. One common method to account for these is to calculate principal components of the genetic data and then to include these as covariables in any models. Principal components can be generated using plink or other tools.
For more information about how to do this in plink see:https://www.cog-genomics.org/plink/1.9/strat
An common method used to account for population substructure is by using linear mixed models. For example using the bolt LMM software tool.
8.12. Common tasks
Here we provide links to webpages that provide instructions or provide brief details any code for completing common tasks using the various software we have described above (section x):
- Extract some SNPs from a bgen data file and convert to plain text.
https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/filtering_variants.html
- Extract some SNPs from bed data:
http://zzz.bwh.harvard.edu/plink/dataman.shtml
plink –bfile mydata –chr 2 –from-kb 5000 –to-kb 10000
- Reading .bgen and .sample oxford files in plink
Plink supports bgen files but it is fussy about the types of its columns in the data.sample file. You may wish to remove or retype columns to read a data.sample file into plink. For more info see:
https://www.cog-genomics.org/plink/2.0/input
To make a new sample file removing some columns you can use the Unix command: 'cut -f 1,2,3 -d " " data.sample > data2.sample'
8.13. Courses
Working with 'Omics data can be complicated but there are many excellent resources available to help you learn how to do this. There are both paid in person courses and free online courses.
Details on paid courses offered by Bristol University can be found here: https://www.bristol.ac.uk/medical-school/study/short-courses/ In addition, a number of free online courses are summarised here: https://www.mooc-list.com/tags/bioinformatics
8.14. Further sources of help
8.14.1. Stack exchange
Stack exchange is an online Q&A community which is divided into different sub-communities. The first and most well-known is Stack overflow. This is one of the best place to ask questions about programming on the Internet. Other useful exchange sites include bioinformatics https://bioinformatics.stackexchange.com/, maths https://mathoverflow.net/ and statistics https://stats.stackexchange.com/.
8.14.2. Bio-stars
Biostars is bioinformatics community Q&A web-site: https://www.biostars.org/
8.14.3. Mailing lists
For individual product/projects there is often a mailing list. For example to get help using SNPTEST you can ask on the mailing list https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#contact
8.14.4. AI tools
AI tools such as chatGPT can be useful to understand how to work with omics data.
8.14.5. Ask ALSPAC
If you can not find the answer to your question or you think there is something wrong with your data then please contact the alspac-omics@bristol.ac.uk mailbox and we will do our best to help you.