Quality control to improve properties of sequence genotypes from different sources

D.J. Null*, J.B. Cole, A. Al-Khudhair, and P.M. VanRaden

Animal Genomics and Improvement Laboratory, ARS, USDA, Beltsville, MD


2020 J. Dairy Sci. (?)
© American Dairy Science Association, 2020. All rights reserved.
Individuals may download, store, or print single copies solely for personal use.
Do not share personal accounts or passwords for the purposes of disseminating this article.
 

ABSTRACT

Sequence genotypes from run7 of the 1000 Bull Genomes Project, high-density array genotypes for many of the same bulls, and additional sequence data were examined to determine optimal editing strategies. The 3,093 sequenced animals in the run7 Bos taurus analysis included 928 Holsteins, 175 Brown Swiss, 156 Ayrshires/Red Dairy Cattle, 105 Jerseys, 51 Montbeliardes, 22 Normandes, and 20 Guernseys; 1,429 were selected as animals of interest after editing or removing bulls with low coverage; incorrect identification, breed, or pedigree; duplicate sequence genotypes; or sequence genotypes that were inconsistent with chip genotypes. An additional 241 bulls had sequence variants identified locally by SAMtools rather than globally by GATK now used in run7. For chromosome 29 as an example, the raw global analysis identified 149,684 variants, and the local data identified 99,600; surprisingly, the overlap was only 48,266 variants. Thus, half of the variants in local data were not in the global data, which were expected to be a superset. Known lethal recessive alleles affecting fertility were present and retained. For quality control, array genotypes from the Council on Dairy Cattle Breeding (Bowie, MD) database included either 79,294 SNP from routine predictions or 643,059 SNP from imputed high-density genotypes. Sequence genotypes for 534 of the run7 animals had matching array genotypes from national data. Concordance of genotypes was better with run7 raw data (98.6% for 69,433 matching SNP) than from the run7 Beagle-imputed subset (98.0% for 61,299 SNP). After excluding multiallelic variants, which were 9% of the run7 raw variants, 48,056,551 variants were polymorphic in the 1,429 dairy animals and included 11% insertions and 4% deletions. Genotypes were then edited for missing rate, parent-progeny conflicts, excess heterozygotes, and minor allele frequency of >1% in at least 1 breed. After removing loci in a few potentially mismapped regions of the ARS-UCD1 reference map, an edited total of 6,735,530 loci were available to impute genotypes for other animals and investigate phenotypic effects.

Keywords: variant calling, genotype concordance, sequence variant