Genome-wide association studies (GWAS)
have seen an increase in popularity and success due to continued advancements
in recent years. Such studies aim to provide a way to identify common genetic
variants that could account for the genetic risk components of common human
diseases. There are two common GWAS designs, a case-control study and a
population-based study. The study discussed here was based on the case-control
disease in question for this study was chronic kidney disease (CKD), a
long-term condition that effects millions worldwide. The aim of this GWAS study
was to try and identify SNPs that may be associated with CKD. The identification of such genetic variants may
help to reveal the biological processes underlying CKD and could aid in improving
risk estimates and detection for the disease. Following quality control
and filtering the study managed to identify two SNPs in the discovery dataset that
also appeared in the replication dataset. This could indicate a potential link
between these SNPs and the occurrence of CKD, but further investigation would
need to be carried out to verify these results. This review also aims to describe the methods as well as some of
the strengths and limitations of GWAS for carrying out such studies.
Kidneys are responsible for the
life-sustaining functions of filtration, reabsorption, secretion and excretion within
the human body. The kidneys filter
and reabsorb over 220 litres of fluid to the bloodstream every 24 hours. Without
them toxic levels of waste products and excess water begin to build up in the body.
CKD is a long-term condition and usually refers to any renal condition whereby
the kidneys begin to gradually lose function. It is estimated that around 2.6
million people aged 16 and older suffer from CKD in England alone (1) and prevalence has been seen increasing
worldwide. Therefore, it presents a significant public health problem. In its most severe form,
end-stage renal disease (ESRD), dialysis is required. In 2009, it was estimated
that ESRD affected over 500,000 adults in the USA with expenditure reaching
$42.5 billion, at that time (2). Typically, CKD is discovered by accident
when patients seek medical attention regarding something else or during routine
check-ups. If left undiscovered, by the time any outward symptoms appear it is
often too late for preventative measures and life-saving surgery may be the
While anyone can suffer from CKD it has
been found to be more prevalent among people of black and south Asian descent (3). CKD has many causes, however, the two biggest
and most well-known causes are hypertensive nephropathy and diabetic
nephropathy (DN). DN affects up to 40% of diabetic patients and is the leading
cause of ESRD (4). While links between ethnic background and
lifestyle have been established as increased risk factors of CKD, a portion of
the risk of CKD still remains unexplained. This may suggest a possible genetic
contribution to CKD. Diabetic siblings of patients with ESRD, due to diabetes, were found to
be 5-times more at risk of developing ESRD compared with those without a family
history of the disease (5). The genetic
component of CKD has been shown in previous familial aggregation studies that
looked at families with a history of diabetes and hypertension. The heritability for glomerular filtration rate
(GFR) was estimated to range from 36 to 75% and from 16 to 49% for albuminuria (6)(7). Given the many potential genetic risk
factors for common diseases such as CKD, a genome-wide association study is an
excellent screening tool to discover genetic risk.
discussed here was based on the case-control design, these typically compare the frequency of alleles or genotypes at
single-nucleotide polymorphisms (SNPs) (8) in order to determine if there is an
association between SNPs and disease phenotypes. The allele frequency of each SNPs is compared between individuals
with a disease (known as the cases), and individuals without (known as the controls).
A precise definition of cases and controls is crucial, as case-control studies tend
to be prone to selection bias. This occurs when controls are not representative
of the population of cases, and this is one limitation often involved in GWAS.
The wider aim of such studies is to
identify sets of loci that may be linked to common complex diseases, these loci
require further analysis after GWAS. Therefore, GWAS act as an important
preliminary step in the gene identification process (9).
· Computer workstation with Windows
software (10) for
genome-wide association analysis:
for data analysis and graphing:
Genotype quality control (QC) and filtering was
conducted at both the individual level and the SNP level. It is important to carry out filtering and QC
to try and remove any false positive associations
within the datasets. Several QC steps must be carried out in an attempt to
remove individuals or markers with particularly high error rates. Filtering and QC was carried out using gPLINK.
gPLINK is a JAVA based program that allows us to carry out common PLINK (10) commands on a simplified interface. gPLINK
allows for integration of results into Haploview (11), which was then used to produce the
Manhattan Plots for this report.
The discovery dataset contained 39637 variants and 478 individuals,
233 of which were males and 245 of which were females. Per-individual QC of GWA data consists
of at least four steps, these involve the identification of individuals:
1. With discordant sex information
2. With outlying missing genotype or heterozygosity rates
3. Of duplicated or related individuals
4. Of divergent ancestry (12)
The first step was to convert the MAP files into BED files and then check
the discovery dataset for potential sample identity problems. Each of these
steps were carried out on gPLINK following standard procedure for such GWAS.
The standard protocol and reasoning behind each step can be found in the
literature (12) (13). After
carefully examining other GWAS it was decided to
exclude all individuals with a genotype failure rate ? 0.03 and/or
heterozygosity rate ± 3 standard deviations from the mean (12). To reduce computational
complexity the number of SNPs used to create the identity by state (IBS) matrix,
in the next step, were provided from a pre-pruned dataset. Duplicated or related
individuals were filtered using an IBD > 0.185, this figure was chosen because it
is standard in other literature, as it is considered to be halfway between the
expected IBD for third- and second-degree relatives (14). Five nearest neighbours were identified for each
individual based upon the pairwise IBS distance. IBS distance to each of the
five nearest neighbours was then transformed into a Z score. Individuals with a minimum Z score among the five nearest neighbours less than
-4 were excluded from analysis as population outliers (15).
QC of GWA data consists of at least four steps, these involve the identification
1. With an excessive missing genotype
2. Demonstrating a significant
deviation from Hardy-Weinberg
3. With significantly different
missing genotype rates between cases and controls
4. The removal of all makers with a
very low minor allele frequency (12)
It should be
noted that there are no universally accepted thresholds for the exclusion
criteria in QC, but all values used below were chosen based on other similar
GWAS literature (16)(17).
Variants were excluded if they did not meet the following thresholds:
Minor allele frequency: 0.01
SNP missingness: 0.05
Individual missingness: 0.03
This produced the clean GWA dataset which was then used in the
Association Analysis. A conventional ?2 test
for association was carried out, details of which can be found in the
literature (8). The
following criteria were selected:
required observation per cell: 5 ?
intervals: 0.95 ?
options: max(T) permutation mode: 10000 ?
The odds ratio (OR) was then
calculated according to a model of logistic regression without considering
covariates. The PLINK (10) command
“–allow-no-sex” was used for each
step in the association as well as the inclusion of the alternate phenotype
The values in the P column of the
data produced were then filtered by p
< 10?5 to identify the statistically significant associations. While the current standard for genome-wide significance is p?5?×?10?8 (18), some argue this p value threshold is too conservative (19) and that a relaxation in the threshold may be appropriate for some studies (20). The National Human Genome Research Institute have used the cut-off value of p < 10?5 in over 700 GWA studies (21) and this value has been chosen for this study also. All of the QC, filtering and association analysis steps were repeated on a replication dataset. SNPs for replication were not ideally selected, but rather were genotypes available from another genotyping laboratory for a diabetic kidney disease (DKD) cohort and were a 'best-case-scenario' available to follow-up results of the discovery GWAS. There were no clinical covariates available for the replication dataset. The replication dataset contained 7 variants and 96 individuals of unspecified sex. The PLINK (10) command "–allow-no-sex" was used for each step in the replication as sex of the individuals was ambiguous in this case. Results The discovery dataset contained 39637 variants and 478 individuals, 233 of which were males and 245 of which were females. After genotype QC and filtering the clean GWA data contained 267868 variants and 465 individuals, (229 cases and 235 controls). Following the logistic regression step, the values in the P column of the data produced were filtered by a value of p < 10?5 to try and identify any statistically significant associations. This produced two SNPs, these can be seen in Table 1 below. CHR SNP BP A1 TEST NMISS OR STAT P 13 rs1591173 24050943 C ADD 464 1.975 4.447 8.69E-06 13 rs4522294 93545188 G ADD 464 3.032 4.565 5.00E-06 Table 1: SNPs remaining in the discovery dataset following QC and filtering The replication dataset was much smaller and contained only 7 variants and 96 individuals (no specified sex), 61 of which were cases and 35 controls. When QC was carried out on the replication cohort the clean replication GWA data produced 7 variants and 92 individuals. The 7 SNPs can be seen in Table 2 below. CHR SNP BP A1 TEST NMISS OR STAT P 1 rs12124937 168825618 G ADD 92 0.2048 -4.297 1.73E-05 2 rs9287656 15428438 A ADD 92 0.2188 -3.474 0.0005121 2 rs10173491 15451774 C ADD 92 0.2188 -3.474 0.0005121 11 rs3740769 77227076 C ADD 92 0.1462 -4.247 2.17E-05 13 rs1591173 23476804 A ADD 92 0.2805 -3.85 0.0001181 13 rs4522294 92892935 A ADD 92 0.2805 -3.85 0.0001181 14 rs8008661 41779819 A ADD 92 0.2048 -4.297 1.73E-05 Table 2: SNPs remaining after QC on the replication dataset Haploview (11) was used to produce a Manhattan plot of all SNPs from the discovery dataset following the logistic regression phase. This can be seen in Fig 1 accompanying this report. Fig 2 shows the remaining 2 SNPs after the discovery data was filtered by p < 10?5. Discussion The results from the discovery dataset appear to show that 2 SNPs out of 39637 were statistically significant, following QC and association. The same 2 SNPs were also present in the replication cohort: rs1591173 and rs4522294 both located on chromosome 13. This would seem to suggest that the results were replicated. Replication is essential for establishing the credibility of a genotype–phenotype association. However, there is still ongoing debate on what constitutes an adequate replication study (22). Despite this result, the dataset provided for the replication phase of the study was far too small to carry out QC on. Running the QC on the replication dataset had basically no impact on the final SNPs. There were not enough SNPs present in the replication data to identify any relationship between samples. As stated earlier, the replication dataset was a highly selected group of patients with DKD, therefore you would not expect genotype distributions to be within HWE. One solution to replicating results could be requiring that replication studies use the same phenotype and definition of phenotype as the discovery cohort so as to help and avoid false positives (23). This was obviously not the case in this study as the individuals in the discovery cohort were said to have had CKD with no mention of any underlying illness that may have caused it, and individuals in the replication cohort specifically had DKD. Thus, it cannot be said that the results show credible association and are not just a chance finding. Small sample size is a frequent problem in such GWAS and usually results in insufficient power to detect minor contributors of one or more alleles. Likewise, small sample sizes can provide imprecise or incorrect estimates of the magnitude of the observed effects. A lack of comparability between cases and controls - can increase the risk of biases because there can be heterogeneity in exposure to environmental challenges and population stratification. The latter arises when investigators fail to account for case - control differences in the genetic structure of the underlying population. Similarly, 'data dredging' is a significant problem in such GWAS (23). Considering there are no defined criteria for the thresholds during QC, data can be altered in order to achieve results that appear to be of statistical significance and worthy of publication. Another issue with the replication data, in this study, was the fact that the sex of the individuals was not specified. The "–allow-no-sex" command had to be used for the association and replication. This was necessary as sex of the individuals was ambiguous in this case and when the sex is not present, PLINK (10) forces ambiguous-sex phenotypes to missing, and the process would not have generated any association file results. In the field of GWAS the importance of QC has been well appreciated for some time, and even small sources of systematic or random error can result in false associations or obscure real ones. Therefore, allowing the sex to remain ambiguous would likely have an impact on results. In fact, some have suggested that separate studies should be carried out for male and females all together. There is mounting evidence of the importance of sex-differentiated effects in complex traits (24). Consider rs17810398 within the DAPL1 gene. Previous GWAS did not discover any association between this SNP and age-related macular degeneration, however, when analyses were stratified by sex, the association was highly significant for females (p?=?2.6?×?10?8) and not males (p?=?0.382)(25). This shows that significant associations can be lost when combining male and female data into one dataset. It has also been reported that sex-differentiated or sex-specific effects may be a contributor to the "missing heritability" of complex traits in such GWAS (24).