subset to the original allele alignment to produce a hit table of the
expected probe pattern for all possible genotypes (again with allge-
nos_hit_table_2.R). The number of exact matches to each probe
sequence or its reverse complement in the forward and reverse files
(read1.fastq and read2.fastq, respectively) for each sample was
then counted. Aggregate counts above a threshold of 10 were
considered positive, and the results were compared with the geno-
type hit table with a bespoke R script (KFFsums.R) for assigning the
KIR allele genotype.
KIR Sequence-Alignment Genotype by SOS. Data used for gener-
ating filters and reference sequences were obtained from the
IPD-KIR database (release 2.6.1, 17 February 2015) and the set of
29 complete KIR haplotype sequences.
6,14,15
For each given KIR
gene, all available allele sequences were selected for use as a posi-
tive filter, and all allele sequences of all other KIR genes were used
as a negative filter. We selected sequence reads specific to the given
KIR by mapping them to the positive filter with Bowtie 2.
50
We
performed this mapping non-stringently (
R97% nucleotides
matched) to allow for the detection of unknown SNPs. All reads
that aligned to the positive filter were retained, and those that
did not were excluded. The retained reads were mapped strin-
gently (
R99% match) to the negative filter, and in this case, those
that aligned were excluded. Finally, the selected reads were
aligned to a single reference sequence that was chosen for each
KIR gene (
Figure S3
), and the SNP variants were ascertained with
SAMtools/BCFtools version 1.2.
35,50
We analyzed the resulting variant call files (VCFs) with a custom
R script (jSOS) to generate genotypes based on the known combi-
nations of SNPs (i.e., the unique KIR alleles available from IPD). A
post-filtered and aligned read depth of 20 was used as the mini-
mum for calling the genotype at any given nucleotide position.
The jSOS algorithm determines all of the possible allele combina-
tions on the basis of the genotypes of those SNPs that achieve
the threshold read depth. Thus, if a specific SNP fails to reach
the threshold, the ambiguity of the final allele call will increase,
but an incorrect allele-level genotype will not be returned.
When the combination of SNPs does not correspond to a known
pair of alleles, jSOS identifies that a novel combination of known
SNPs is present. In these situations, the known allele most likely
present is identified in addition to the new combination (we
view this as the most parsimonious genotype, and the least parsi-
monious could be two novel combinations present). The jSOS al-
gorithm also identifies novel SNPs. When manual confirmation
was sought, such as these cases of novel polymorphism, align-
ments for visual inspection were created with MIRA 4.0.2 and
examined with Gap4 of the Staden Package.
56
HLA Class I Genotypes
The HLA class I allele compositions were determined with
NGSengine 1.7.0 (GenDX software), kindly provided by Wietse
Mulder and Erik Rozemuller, with the ‘‘IMGT 3.18.0 combined’’
reference set. There was no pre-filtering for HLA genes, and the
data were analyzed directly with the software. The one exception
was the removal of reads mapping to HLA-Y, an HLA class I pseu-
dogene present in a subset of HLA haplotypes.
57
The presence or
absence of HLA-Y was determined via sequence-specific string
searches of the FASTQ data. For the IHWG cells, these assignments
agreed with those previously determined by PCR.
43
We further
validated the assignment of HLA class I alleles by analyzing the
sequence data with Assign MPS 1.0 (Conexio Genomics), kindly
provided by Damian Goodridge. We resolved any discrepancies
between the two methods, or with previously obtained results,
by designing virtual probes that distinguish the two possibilities
in question and searching the unprocessed sequence-read data
(as described for KFF). Then, sequence reads specific to the locus
under question were extracted and aligned to reference sequences
(as described for SOS) and inspected manually. Here, the reference
sequences used were chosen on the basis of the HLA class I geno-
type of the respective sample. The majority of HLA class I alleles
are not fully characterized through all exons and introns
(
>95%).
15
Thus, for some of the validations, the second or third
fields of resolution were compared.
Haplotype Assignments
Haplotypes derived from homozygous cell lines were unambigu-
ous. Haplotypes were assigned for the family trios by segregation.
For all other individuals, centromeric and telomeric allele-level
KIR haplotypes were assigned for each individual according to
the expectation-maximization algorithm of haplo.stats imple-
mented in the R programming language (see
Web Resources
).
Results
Validation of the KIR Capture Method
We applied our capture/NGS method to DNA extracted
from 97 publically available cell lines (sample set 1,
Mate-
rial and Methods
), originally collected by the IHWG to
facilitate study of HLA genes.
40
One of these cell lines
is PGF, whose KIR haplotypes have been characterized
previously by standard methods.
6,14
We therefore used
the data we obtained from PGF cells to assess the quality
of the KIR sequences produced. The PGF KIR sequence
reads were mapped back onto the two conventionally
determined haplotype sequences. For PGF KIR haplotype
1 (137,813 bp; GenBank: FP089703), our method gave
100% coverage, a mean read depth of 49.4
3, and a read
depth of
>103 for 99.5% of the haplotype (
Figures 3
A
and 3B). For PGF KIR haplotype 2 (142,732 bp; GenBank:
FP089704), we obtained 99.99% coverage, a mean read
depth of 49.8
3, and a read depth >103 for 99.5% of the
haplotype. Our haplotype 2 sequence contains two short
gaps, comprising 18 nucleotides in total (
Figure 3
C).
The missing sequences are not predicted to be of func-
tional importance (
Figure 3
C). Thus, our method gives
full coverage of the KIR region when it targets haplotypes
that were included in the probe design. However, for pop-
ulation studies, it is important that our probe sets can cope
with a wider (and unknown) range of diversity without
any ‘‘allelic dropout.’’ As a test, we applied our method to
genomic DNA from the subject of the chimpanzee genome
project.
44
We mapped the obtained reads to the two
full KIR haplotype sequences (GenBank: BX842589 and
AC155174) that were previously characterized from this
individual.
25,58
These sequences represent both haplo-
types of the KIR region and include 10 of the 14 chim-
panzee KIR genes. We obtained 98.8% coverage for both
of the haplotypes and mean read depths of 130
3 for haplo-
type 1 and 110
3 for haplotype 2. This success in capturing
the chimpanzee KIR region strongly indicates that our
method captures the full range of human KIR haplotypes.
380
The American Journal of Human Genetics 99, 375–391, August 4, 2016