Abstract: Prediction of number of lymph nodes in breast cancer patients is an important
factor to acknowledge the severity and prognosis of cancer disease. The distribution of involved
nodes contains a large number of negative nodes(zeroes) which leads to over-dispersion .Until
now, the negative binomial model has been used to describe this distribution, assuming that over-
dispersion is only due to unobserved heterogeneity. In such situation of over-dispersion some
alternative distributions like Poisson regression models can also account better for over-
dispersion. The primary objective of this paper is to estimate the number of involved lymph
nodes in breast cancer patients using Poisson and Negative Binomial regression approach under
Bayesian set up. In this work, we analyze Poisson and negative Binomial Regression in a
Bayesian setting, by introducing a prior distribution on the weights of the linear function. Since
exact inference is analytically unobtainable, so we derive a closed-form approximation to the
posterior distribution of the model.
Keywords: Lymph node, Bayesian Poisson, Negative binomial, Over dispersion
INTRODUCTION
Carcinoma of the breast is the most common malignant tumor and is the most common cause of
death from carcinoma in females . It is the most commonly diagnosed malignancy among
women and has become a big threat to human beings globally. There are many important
prognostic factors of breast cancer like Estrogen Receptor (ER)/Progestrone Receptor(PR) status,
tumor grade, tumor size, HN2 status, number of associated lymph nodes etc. Among these, the
existence of lymph nodes is the most important prognostic factor. The number of involved lymph
nodes in breast cancer patients is an important factor to get an idea about the severity, prognosis
of disease . Its accurate prediction in breast cancer patients helps in grading severity of disease,
according to which extensive auxiliary surgery dissections can be avoided (Hernadez et. al.
2006; Slymen et.al.2006). Although it is an important prognostic factor but it is not necessarily
associated with stages of cancer, as the patient with same number of lymph nodes may be in
different stages and the patients with more number of lymph nodes are not necessarily in more
advanced stage(Yafume et.al.(1993)). It is also highly variable within the populations because of
the heterogeneity in population demographics as well as the interaction between various random
and non random biophysical factors within individuals (Weiss,1983; Kendall,2005) . The
involvement of axillary lymph node is the most significant prognostic factor for patients with
early –stage breast cancer, and it is directly related to the risk of distant recurrences(Cianfrocca,
2015). Many studies had been conducted for checking the status of lymph nodes (present or
absent) in breast cancer patients(Cianfrocca and Goldstein, 2004) but only few are on estimation
of number of lymph nodes. Until now, the negative binomial model has been extensively used to
describe this distribution, assuming that over-dispersion is only due to unobserved heterogeneity
but some alternative distribution like Poisson regression models can also account better for over-
dispersion. It is found that a negative binomial model better describes the number of nodal
involvement than the Poisson model due to excess variability(over dispersion) (Hung et al.,
2008). The over dispersion is basically due to excess zeroes as the distribution of involved nodes
contains a large number of negative nodes(zeroes), which is an independent prognostic factor of
DFS(disease free survival) in breast cancer patients after mastectomy, and also patients with
higher number of NLNs(negative lymph nodes) have a better DFS(Wu et.al., 2015).
Consequently, the Poisson or Negative Binomial distribution may not provide satisfactorily
results in case of over dispersion and so in such situations zero hurdle and zero inflated
regression models can be used to increase predictability. A study had been conducted to estimate
the number of lymph nodes and concluded that Zero Hurdle Negative Binomial and Zero inflated
Negative Binomial more appropriately describes the distribution of lymph nodes than negative
binomial in case of over-dispersion ( Dwivedi et. al. 2010). In this study we try to estimate the
number of lymph nodes using Poisson and Negative Binomial regression under Bayesian
approach. The reason for choosing Bayesian is that Bayesian approach make it easier to estimate
and analyze complicated problems where using standard methods is quite cumbersome. To the
extent of our knowledge no such type of studies have been carried out. In this work, we analyze
the standard Poisson and negative binomial regression model in a Bayesian setting, by adding a
normal prior on the regression coefficients. The reason for choosing normal prior is that the
likelihood of Poisson distribution belongs to exponential family and also the normal distribution
belongs to exponential family , and when a family of conjugate priors exists, choosing a prior
from that family simplifies calculation of the posterior distribution.
Although this paper is concerned with less sophisticated analyses in which the driving force is
the desire for the Bayesian framework, it’s important to note that the consummate value of the
Bayesian method might be to provide statistical inference for problems that couldn’t be handled
without it.
Material and Methods:
Data : The study population includes all female primary breast cancer patients treated at breast
clinic. (Dept of GenSurgery, IPGMER, SSKM Hospital, Kolkata) from Jan 2009 to Dec 2010,
and had their pre-op serum CA15-3 measured and it was reported on 7, 30 post op day and every
6 months for 2 years. Patients were excluded if any other malignancy was known from their
previous history or if staging investigations at the time of diagnosis revealed evidence of instant
metastasis. A total of 85 patients fulfilled the criteria for this analysis. Patients were treated with
either modified radical mastectomy (MRM) or quandrantectomy and auxiliary lymph node
dissection with local radiotherapy (RT). After completion of surgery, RT and appropriate
adjuvant chemotherapy or hormone therapy was not altered according to marker levels but was
administered as indicated based on international guidelines.
Poisson regression: Poisson regression analysis derives its name from the Poisson distribution
which is a mathematical distribution often used to describe the probability of occurrence of count
data. Let iY
denotes the number of nodes for the thi
breast cancer patient. Since these data are in
terms of counts, therefore, we assume that iY
follows a Poisson distribution with mean i
(mean
number of involved nodes). Hence, the probability of observing any specific count iY
is given
by the following formula:
0,…,2,1,0
!)(
ii
i
y
i
iiy
y
e
yYP
ii
We postulate that the mean value i
depends on a set of predictors pxxx,…,,21
such that
ppixx…)log(
110
Or, )…exp(110ppixx
,
‘
X
ie
Bayesian Poisson regression
We have,
,…2,1,0
!)/(
i
i
y
i
iiy
y
e
yf
ii
Let
)exp(
1
j
p
j
ijix
, be the linear combination of covariates, where
ijx
(i=1,2,…,n; j=1,2,…,p) are
the covariates and sj’
are the regression coefficients, then
!
]][exp(exp(exp[
)(11
i
y
j
p
j
ijj
p
j
ij
i
y
xx
yf
i
= !
]exp()exp(exp[
11
i
j
p
j
ij
i
ij
p
j
ij
y
xyxi
)()(
1
n
i
iyyfl
=
n
i
i
j
p
j
ij
j
jj
p
j
ij
n
i
y
xx
1
111
!
])(expexp[
Let us assume the prior for β as
),(~jjjbaN
for j=1,2,…,p
So the joint density of β’s can be written as:
0,;]
2
)(
exp[
)2(
1
),…,,(
2
1
2
121
jj
j
jj
p
j
j
pba
b
a
b
p
Therefore the posterior for β’s can be obtained as:
)().()/(
~
pLYP
yij
=
n
i
i
j
jj
p
j
j
j
p
j
ij
j
jj
p
j
ij
n
i
y
b
a
b
xx
1
2
1
2
1
111
!
]
2
)(
exp[
)2(
1
.])(expexp[
=
n
i
j
p
j
i
p
jj
jj
j
p
j
ij
j
jj
p
j
ij
n
i
by
b
a
xx
11
1
2
111
2!
]
2
)(
)(expexp[
=
n
i
j
p
j
i
j
j
j
j
jij
j
j
jj
j
j
n
i
p
j
ij
by
b
a
b
a
yx
bx
11
22
11
2!
]
2][
2
1
)exp(exp[
Let j
j
jijj
b
a
yxd
for j=1,2,…,p
So,
)(
jP
)1(
2
1
)exp(exp[
2
11
j
jj
jj
j
j
n
i
p
j
ijd
bx
Which on simplication will yield a normal distribution with mean anddj
variance 2 jb
.
Bayesian Negative Binomial Regression:
Let iY
denotes the number of nodes for the thi
breast cancer patient. Since these data are in
terms of counts, therefore, we assume that iY
follows a Negative Binomial distribution with
mean i
(mean number of involved nodes, p is the probability of having positive node and r is
the number of negative nodes preceding the first positive node). Hence, the probability of
observing any specific count iY
is given by the following formula:
Let ),(~prNBDYi
So
10,…;3,2,1,0)1(
!
)(
)(
pypp
ry
ry
yYPiyr
i
i
i
i
We Know p
pr
yEii)1(
)(
& r
p
pr
yViiii
2
2
2
2)1(
)(
So
iy
i
ir
ii
i
i
rr
r
ry
ry
yP][]
)([
!
)(
)(
Let
)exp(
1
j
p
j
ijix
, be the linear combination of covariates, where
ijx
(i=1,2,…,n; j=1,2,…,p) are
the covariates and sj’
are the regression coefficients.
So the likelihood of Y’s can be written as:
)()(
1
n
i
iyPYL
=
n
i
y
i
i
n
i
i
nn
i
i
i
i
r
r
r
ry
ry
1
1
1
)(!
)(
i
iii
i
iryrrnCYL)log(log)log(log)(log
=
i
ii
i
iiyryClog)log()1(
1
Now as
)exp(
1
j
p
j
ijix
, so
=
ij
jiji
i
x
ixyreyCjij)log()1(
1
Let us assume the prior for β as
),(~jjjbaN
for j=1,2,…,p
So the joint density of β’s can be written as:
0,;]
2
)(
exp[
)2(
1
),…,,(
2
1
2
121
jj
j
jj
p
j
j
pba
b
a
b
p
Therefore the posterior for β’s can be obtained as:
]
2
)(
exp[
)2(
1
)log)log()1(()/(
2
1
2
11
j
jj
p
j
j
i
ii
i
iiij
b
a
b
yryCyP
=
)2()(
2
1
)log()1(2
1j
j
j
j
j
j
iji
i
x
ia
bxyreyCjij
We first try to obtain the posterior estimates of regression coefficients (β’s) taking the prior as
normal distribution under both Poisson and Negative binomial approach. Then we obtain the
posterior summaries of regression coefficients under both the distribution separately. The
posteriors obtained from Poisson and Negative binomial regressions are shown in equations (1)
& (2). The posterior summaries have obtained using INLA package in R(Lopiano et. al.2013).
All analyses were performed using the R statistical software package version 3.2.2.
Deviance Information Criterion(DIC):
The deviance information criterion (DIC) (Spiegelhalter et al.; 2002) is a model assessment tool,
which is a Bayesian alternative to Akaike’s information criterion (AIC) and the Bayesian
information criterion (BIC, also known as the Schwarz criterion). The deviance information
criterion(Spiegelhalter et al.; 2002) is a Bayesian measure that takes account of both the
“goodness of fit” and the “complexity” of a fitted model. It can be used for comparing and
ranking competing models. “Complexity” is measured as the “effective number of parameters”
( Dp )which can be defined as the posterior mean deviance minus the deviance measured at the
posterior mean of the parameters.
Let ),…,,(21nyyyy
be the n-dimensional r.v. whose distribution depends on a p-dimensional
vector ,then
)]([][//
yyDEDDEp
So the DIC can be defined as:
DypEDDIC2)]([/
= DypDE][
/
Results: The study population includes 85 breast cancer patients, who were diagnosed from Jan,
2009 to Dec, 2010. The mean age of patients at diagnosis was 50.09 years (SD=12.82), ranging
from 25 to 85 years. The descriptive characteristics of various prognostic factors are shown in
Table1. Among those 85 patients in the study the number of involved nodes was found in 35
( 41.2%) patients. The mean and standard deviation of number of involved nodes per patient
were 4.4 and 4.7 respectively (median=2 and range 0-15).
Table1: Descriptive Characteristics of Breast Cancer Patients (N=85)
Factors
Categories(Code
) Frequency Percentage
CA15-3
<25 10 11.8
≥25 75 87.2
T Size (cm)
<2 (0) 24 27.9
2-5 (1) 48 55.8
≥5 (2) 13 15.1
Nodal M
0-3 (0) 50 58.8
4-9 (1) 19 22.4
≥9 (2) 16 18.8
Tumor Grade
I (1) 23 27.1
II (2) 42 49.4
III (3) 20 23.5
ER Status
Negative (0) 40 47.1
Positive (1) 45 52.9
PR Status
Negative (0) 48 56.5
Positive (1) 37 43.5
HN2 Status
Negative (0) 53 62.4
Positive (1) 32 37.6
Table 2 includes the summary statistics of posterior marginals of the fixed effects plus the
precision of error term and the intrinsic random effect obtained from Poisson Distribution and
Negative binomial distribution. From the table it can be viewed that larger tumor size within the
range (2-5cm) is highly associated with increased risk of higher number of lymph nodes. A
descriptive comparison reveals that tumor size , tumor grade(III) and CA15-3(preoperative
value) is consistently significant across both the models. Additional to these PR status is
significant in Poisson regression model and ER status and HN2 status is significant across
Negative Binomial regression model. The standard error of the mean is also known as the Monte
Carlo standard error (MCSE) and it provides a measurement of the accuracy of the posterior
estimates. The Marginal likelihood which can be computed by integrating out the all model
parameters and is used for model selection is higher for Negative Binomial Distribution as
compared to Poisson Distribution (-258.48/-351.26), which suggests that negative binomial
distribution is best for predicting number of lymph nodes of breast cancer patients as compared
to Poisson distribution. Also the DIC for negative binomial regression model(445.62) is less
than as compared to Poisson regression model(613.04) which yields to same result that Negative
binomial distribution can better explain the distribution of number of lymph nodes than Poisson
distribution.
Table2: Posterior Estimates obtained by using Poisson Regression and Negative Binomial
Regression Model
Parameters
Poisson regression model Negative Binomial regression model
Mean SD 95 % HPD Mean SD 95 % HPD
(Intercept) 0.216 0.309 (-0.403, 0.811) -0.186 0.851 (-1.841, 1.510)
Age 0.007 0.004 (-0.002 , 0.016) 0.010 0.013 (-0.014, 0.036)
Tumor Size(in cms)
2-5 0.756 0.163 (0.444 , 1.083) 0.938 0.426 (0.097, 1.775)
≥5 1.181 0.190 (0.812, 1.558) 1.303 0.525 (0.293, 2.359)
Tumor Grade
II -0.052 0.150 (-0.345, 0.244) -0.049 0.415 (-0.871, 0.761)
III 0.165 0.154 (0.014, 0.469) 0.210 0.448 (0.071, 1.092)
ER Status 0.036 0.136 (-0.23, 0.304) 0.149 0.416 (-0.677, 0.959)
PR Status 0.137 0.125 (-0.110, 0.383) -0.108 0.399 (-0.892, 0.678)
HN2 Status -0.057 0.116 (-0.286, 0.169) 0.088 0.350 (-0.593, 0.786)
CA15 (Pre-op) 0.124 0.201 (-0.258, 0.533) 0.322 0.518 (-0.731, 1.310)
DIC 613.04 445.62
Discussion: The most significant prognostic indicator for patients with early-stage breast cancer is the
presence or absence of axillary lymph node involvement. The number of involved nodes is most
important prognostic factors for breast cancer(Hernandez Avila et al.2006). In fact it is an
important indicator to acknowledge severity, and progonosis for breast cancer. Although it is an
important prognostic factor but it is not necessarily associated with stages of cancer, as the
patient with same number of lymph nodes may be in different stages and the patients with more
number of lymph nodes are not necessarily in more advanced stage(Yafume et.al.(1993)).
Clinicians need to predict the number of involved nodes in breast cancer patients in order to
improve health outcomes. Many studies had been conducted to check the status of lymph nodes
in breast cancer patients(Cianfrocca et. al.2004,Yiangou et.al.1999) but only few studies have
been conducted to describe the number of involved nodes in breast cancer patients(Dwivedi et.
al.2010, Kendal. 2006),. Also the number of nodes exhibits count data so Negative Binomial and
Poisson distribution can be used to describe it. Wayes S.Kendall(2006) illustrate that within any
particular individual, the number of involved lymph nodes would be randomly (Poisson)
distributed, whereas within the population as a whole the number of involved nodes would be
distributed in accordance with a negative binomial distribution. Until now, the negative binomial
model has been extensively used to describe this distribution, assuming that over-dispersion is
only due to unobserved heterogeneity but some alternative distribution like Poisson regression
models can also account better for over-dispersion. It is found that a negative binomial model
better describes the number of nodal involvement than the Poisson model due to excess
variability(over dispersion) (Hung et al., 2008). In this study we describe the distribution of
lymph nodes using Poisson regression and Negative Binomial regression under Bayesian
approach. In this work we analyze standard Poisson and negative binomial regression under
Bayesian setup by adding normal prior on the regression coefficients and try to obtain the
posterior distribution of regression coefficients under both the distribution. From the study it has
been found that negative binomial regression can better explain the distribution of number of
lymph nodes than poisson regression model which match with the findings of (hung
et.al.2008,Kendall,2007).
As from the results it can be find that tumor size, tumor grade and CA15-3 values were found
significant in both the models . Also additional to that ER status was found significant in Poisson
regression model and PR status and HN2 status were found significant in Negative Binomial
regression model.