the Burrows-Wheeler Aligner (BWA-MEM
49
; k
¼ 16) and counting
the number of reads that mapped within the target coordinates
of hg19.
Validating the Results from Capture/NGS via Established Methods
Previously described protocols were used for pyrosequencing
16
and real-time PCR.
12
Standard Sanger DNA sequencing reactions
were performed in forward and reverse directions with the BigDye
Terminator v.3.1 and were analyzed with a 3730 DNA Analyzer
(Applied Biosystems). To verify the sequences of previously
uncharacterized alleles, we cloned PCR products by using the
pCR2.1-TOPO vector (Invitrogen) and sequenced them with
M13 and internal primers. For each individual in whom an allele
was identified, five or more clones corresponding to the allele
were sequenced. HLA class I genotyping was performed with
Luminex-bead-based SSOP hybridization, as described previ-
ously
16
except where indicated otherwise.
Bioinformatics Methods: PING Pipeline
Harvesting KIR-Specific Reads
For sequence data obtained by the capture/NGS method, sequence
reads specific to the KIR region were identified and harvested with
Bowtie 2.
50
We concatenated the 29 sequenced KIR haplotypes
and all KIR alleles from the IPD-KIR database
6,14,15
to create a sin-
gle reference file for this purpose. The equivalent of 70,000 reads
(at 2
3 300 bp) that passed this filter stage were taken per sample
for KIR genotyping.
KIR-Gene-Content Module: PING_gc
This module is divided into two complementary components: the
first (KFFgc) is based on string searches of raw data, and the second
(MIRAgc) is based on read depth following alignment to reference
sequences.
Determining KIR Gene Content with Virtual Probes: KFFgc. The
sequence-read files specific to the KIR region were first processed
with the ‘‘FASTQ Quality Trimmer’’ function of the FASTX-Toolkit
v.0.0.13 (see
Web Resources
). From an alignment of all human KIR
sequences (IPD-KIR release 2.6.1, February 17, 2015),
15
all possible
32 bp sequences specific to just one of the KIR genes were gener-
ated, and those sequences covering polymorphic positions were
removed. For each KIR gene, this process produced a set of gene-
specific but not allele-specific probes. For each gene, ten such
probes were selected at random and used for determining KIR
gene content. We then counted the total number of exact matches
to each probe sequence and its reverse complement in the forward
and reverse (read1.fastq and read2.fastq, respectively) sequence-
read files for each sample. Because every KIR haplotype has one
copy of KIR3DL3,
6,14,16
the presence or absence of other KIR genes
was determined from the mean number of probe hits per target
KIR divided by the mean number of KIR3DL3 probe hits on
KIR3DL3. We used a ratio of 0.2 as the threshold, and values above
this were considered positive.
Determining KIR Gene Copy Number by Analyzing Read Depth:
MIRAgc. As shown previously, the depth of reads aligned to a refer-
ence can be used for estimating structural variation on a genomic
scale.
51,52
We incorporated this concept into the KIR genotyping
pipeline. Using MIRA 4.0.2,
34
we aligned the KIR-region-specific
sequence-read
pairs
simultaneously
against
one
reference
sequence for each KIR gene. Using values extracted from the ‘‘con-
tigstats’’ results table from the MIRA output, we divided the num-
ber of reads that aligned to each KIR gene by the number of reads
aligning to KIR3DL3. This comparison produced a clustering of
read ratios correlating with differences in KIR gene content
(
Figure 2
and
Figure S2
). For example, for KIR2DL4, which is ab-
sent from some haplotypes and duplicated on others,
13,53
we
observed three clusters of read ratios, corresponding to one, two,
or three copies of KIR2DL4 (
Figure 2
). For KIR2DL5, which can
be present at centromeric and telomeric locations in the KIR
haplotype,
6,54
we observed five clusters of read ratios, correspond-
ing to individuals with zero, one, two, three, or four copies of
KIR2DL5 (
Figure 2
). Representative examples of clustering results
derived for all the KIR are given in
Figure S2
.
In very rare cases, there can be copy-number variation of
KIR3DL3. For example, an individual with duplicated KIR3DL3
on one haplotype was observed in a study of
>3,500 subjects.
12
Such individuals would then show an unusual copy number for
all KIR genes and hence would be identified for manual inspec-
tion. The individual identified with a duplicated KIR3DL3 had a
normal genotype of two KIR3DL2 copies,
12
so in this case KIR3DL2
could be used as a substitute standard.
KIR-Allele-Genotyping Module: PING_allele
This module is also divided into two complementary components,
the first (KFFallele) is based on string searches of raw data, and the
second (SOS) is based on read depth following alignment to refer-
ence sequences. A summary of the KFFallele workflow is shown in
Figure S1
A, and SOS is shown in
Figures S1
B and S1C.
High-Resolution KIR Genotyping with Virtual Probes: KFFallele. For
each KIR gene, an alignment of coding-region alleles was gener-
ated. From these, every possible unique 32-mer sequence that
did not overlap an exon boundary was generated, and the result-
ing pool was screened for the removal of all 32-mer sequences pre-
sent in another KIR gene. A ‘‘hit table’’ was generated with the
original allele alignment and bespoke R scripting (allgenos_hit_
table_2.R). The hit table was passed on to a random-forest algo-
rithm in the ‘‘RandomForest’’ package of the R statistical soft-
ware.
55
This algorithm ranks the probes according to the amount
of information they contribute to the final answer. The purpose of
the random-forest step is to obtain the most efficient subset of
probes, thereby increasing the computational speed of genotype
assignment. The output consists of a series of probe subsets that
have an increasing proportion of the total set (5%, 10%, 15%,
etc.). Empirical testing found that the most efficient subset is usu-
ally the 40% most informative probes. We applied the final probe
IHWG cell number
1
2
3
Gene
copy
number
Relative
read
count
0.1
0.5
0.9
1.3
20
40
60
80
100
KIR2DL5
0.1
0.5
0.9
1.3
20
40
60
80
100
KIR2DL4
IHWG cell number
4
0
1
2
3
Figure 2.
KIR Gene Copy-Number Geno-
type Determined by Read Depth
The ratio of reads mapping to a specific KIR
gene to those mapping to KIR3DL3 can be
used for calculating KIR copy number. The
results from 97 samples are shown and
sorted by ratio. KIR2DL4 (left) was present
in one, two, or three copies per individual
in the sample set, and KIR2DL5 (right)
was present in zero, one, two, three, or
four copies.
The American Journal of Human Genetics 99, 375–391, August 4, 2016
379