bnlearn
packageThe bnlearn
package is one of the most complete and popular package for Bayesian Networks available to the date in R
.
First we should install and load the package.
Please make sure that you either download or compile at least bnelarn 4.2.x
as previous versions might have some bugs. check the version of bnlearn
pakcage by typing packageVersion("bnlearn")
in R
console.
You should check out the documentation that is available online [Link]. It contains nice examples and additional use cases beyond this tutorial. If you are interested in this topic, there are a couple of related books written by the author of the package, you can find the references in the mentioned website.
bnlearn
install package bnlearn
bnlearn
bnlearn
bnlearn
code
button to show the hidden code) to console prompt: after >
, then press “enter”.
Update all/some/none? [a/s/n]:
, reply n
then press enter
[1] "Congratulations, the package bnlearn is correctly installed."
, the package bnlearn
is correctly installed.if(require(bnlearn, quietly = TRUE) && require(Rgraphviz, quietly = TRUE)){
print("Congratulations, the package bnlearn is correctly installed.")
}else{
source('http://bioconductor.org/biocLite.R')
biocLite('Rgraphviz')
biocLite("RBGL")
if(!require(bnlearn)) install.packages("bnlearn")
if(require(bnlearn) && require(Rgraphviz)){
print("Congratulations, the package bnlearn is correctly installed.")
}else print("Contact us !!!")
}
#> [1] "Congratulations, the package bnlearn is correctly installed."
>
code
button to show the code) to console prompt: after >
, then press “enter”
"Congratulations, the package bnlearn is correctly installed."
then the package bnlearn
is correctly installed.if(require(bnlearn, quietly = TRUE) && require(Rgraphviz, quietly = TRUE)){
print("Congratulations, the package bnlearn is correctly installed.")
}else{
source('http://bioconductor.org/biocLite.R')
biocLite('Rgraphviz')
biocLite("RBGL")
if(!require(bnlearn)) install.packages("bnlearn")
if(require(bnlearn) && require(Rgraphviz)){
print("Congratulations, the package bnlearn is correctly installed.")
}else print("Contact us !!!")
}
The hands-on workshop is based on the data set asia
, a small synthetic data set from Lauritzen and Spiegelhalter (1988) about lung diseases (tuberculosis, lung cancer or bronchitis) and visits to Asia.
Source
Lauritzen S, Spiegelhalter D (1988). “Local Computation with Probabilities on Graphical Structures and their Application to Expert Systems (with discussion)”. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 50(2), 157-224.
Load
bnlearn
package
>
, then press “enter” (we strongly advise you to type the code than to copy paste them)# list objects in the working environment
ls()
#> character(0)
# Import the data set called asia
data(asia)
# list the variables (column names) in asia
# NOT RUN : colnames(asia)
# Delete the column called "E"
asia$E = NULL
# Verify if the data set "asia" appears in the working environment ?
ls()
#> [1] "asia"
Format
The asia data set contains the following variables:
D
(dyspnoea),
yes
and no
.T
(tuberculosis),
yes
and no
.L
(lung cancer),
yes
and no
.B
(bronchitis),
yes
and no
.A
(visit to Asia),
yes
and no
.S
(smoking),
yes
and no
.X
(chest X-ray),
yes
and no
. Usage
asia
using some basic R
operations :
summary()
.# print data set
# NOT RUN: asia
# You may want to run asia. What it gives?
# The first five lines should be enough to get a very first idea of the data set asia
# Verify the format of data : variable names and types.
head(asia, n = 5)
#> A S T L B X D
#> 1 no yes no no yes no yes
#> 2 no yes no no no no no
#> 3 no no yes no no yes yes
#> 4 no no no no yes no yes
#> 5 no no no no no no yes
# What is the sample size := how many individuals are there?
nrow(asia)
#> [1] 5000
# summarise the data set with a build-in function summary()
summary(asia)
#> A S T L B X
#> no :4958 no :2485 no :4956 no :4670 no :2451 no :4431
#> yes: 42 yes:2515 yes: 44 yes: 330 yes:2549 yes: 569
#> D
#> no :2650
#> yes:2350
There are different ways to manually initialise and modify the graph of a Bayesian Network. We illustrate in this hands-on session one of the options.
[<var_name>]
. If the node has a parent set, we denote it by |
and specify the list of parents separated by colons :
. We can compute the formula into a bn
object with model2network
. (see Example below)model2network()
function# manually define a BN structure using model2network() function
dag_example = model2network("[Earthquake][Alert|Earthquake][Anxiety|Alert:Earthquake]")
# Plot the BN structure with graphviz.plot()
graphviz.plot(dag_example, sub = "Example DAG")
Remark
# the order doesn't matter : the three definitions below are equivalent
model2network("[Earthquake][Alert|Earthquake][Anxiety|Alert:Earthquake]")
model2network("[Earthquake][Alert|Earthquake][Anxiety|Earthquake:Alert]")
model2network("[Earthquake][Anxiety|Alert:Earthquake][Alert|Earthquake]")
Now, it’s your turn :
Suppose we use the following variable names:
A
: Visit to AsiaB
: BronchitisD
: DyspneaL
: Lung CancerT
: TuberculosisS
: Smoking HistoryX
: Chest X-rayDefine a BN structure (a DAG) representing the causal relationships among these variables based on your knowledge :
asia
data set ?R
your BN structure with model2network()
and assign the output of model2network()
, your BN structure, to dag_mydag
(see the example above) graphviz.plot()
.Following code shows one possible solution. Other BN structures are also acceptable.
# Create and plot the network structure.
dag_mydag = model2network("[A][S][T|S][L][B|S][D|B][X|L:T]")
# Plot
graphviz.plot(dag_mydag)
Source This exercise is from this [LINK]
Consider the following piece of medical knowledge taken from [Lauritzen and Spiegelhalter, 1988]:
Create again a DAG representing the causal relationships among these variables based on upper medical knowledge
model2network()
.graphviz.plot()
.# creat Lauritzen's BN structure
dag_Laur = model2network("[A][S][T|A][L|S][B|S][D|B:T:L][X|T:L]")
dag_Laur
#>
#> Random/Generated Bayesian network
#>
#> model:
#> [A][S][B|S][L|S][T|A][D|B:L:T][X|L:T]
#> nodes: 7
#> arcs: 8
#> undirected arcs: 0
#> directed arcs: 8
#> average markov blanket size: 3.14
#> average neighbourhood size: 2.29
#> average branching factor: 1.14
#>
#> generation algorithm: Empty
Remark the generation algorithm is Empty !!
par(mfrow = c(1,2))
graphviz.plot(dag_Laur, sub = "Lauritzen's DAG")
graphviz.plot(dag_mydag, sub = "My DAG")
asia
data set. hc(asia)
can allow you to do this (the Hill-Climbing algorithm, which is a Score-based method, is applied)hc(asia)
)dag_learn_hc = hc(asia)
dag_learn_hc
#>
#> Bayesian network learned via Score-based methods
#>
#> model:
#> [A][S][T][L|S][B|S][X|T:L][D|T:L:B]
#> nodes: 7
#> arcs: 7
#> undirected arcs: 0
#> directed arcs: 7
#> average markov blanket size: 2.86
#> average neighbourhood size: 2.00
#> average branching factor: 1.00
#>
#> learning algorithm: Hill-Climbing
#> score: BIC (disc.)
#> penalization coefficient: 4.258597
#> tests used in the learning procedure: 63
#> optimized: TRUE
Remark the generation algorithm is NOT empty this time !!
graphviz.plot(dag_learn_hc)
compare()
functioncompare()
function compares two Bayesian network structures by listing three elements (output of compare()
) that are the counts of:
tp
) arcs, which appear both in target
and in current
;fp
) arcs, which appear in current
but not in target
;fn
) arcs, which appear in target
but not in current
.compare(target = dag_Laur, current = dag_mydag, arcs = TRUE)
#> $tp
#> from to
#> [1,] "S" "B"
#> [2,] "B" "D"
#> [3,] "T" "X"
#> [4,] "L" "X"
#>
#> $fp
#> from to
#> [1,] "S" "T"
#>
#> $fn
#> from to
#> [1,] "A" "T"
#> [2,] "S" "L"
#> [3,] "T" "D"
#> [4,] "L" "D"
graphviz.compare()
functionAnother approach to compare network structures is to plot them side by side and highlight differences with different colours. graphviz.compare()
does that using Rgraphviz
package while taking care that the nodes have the same position for all networks, to make it easier to spot which arcs are different. As we can see below, and unlike any of the functions we covered above, graphviz.compare()
can take more than two network structures as arguments.
par(mfrow = c(1, 3))
graphviz.plot(dag_Laur, sub = "dag_Laur")
graphviz.plot(dag_mydag, sub = "dag_mydag")
graphviz.plot(dag_learn_hc, sub = "dag_learn_hc")
par(mfrow = c(1, 3))
graphviz.compare(dag_Laur, dag_mydag, dag_learn_hc)
score()
functionWe can compute the network score of a particular graph for a particular data set with the score()
function (manual).
Remark In case the score()
function is defined in different ways in other r packages, you should specify from which package we what to call this function. That is, if there is an error when you call score()
function, try to replace score()
by bnlearn::score()
(may solve the problem) so that R
knows you want to use the score()
function from the bnlearn
package.
Remarks:
score(x = dag_mydag, data = asia)
score(x = dag_Laur, data = asia)
score(x = dag_learn_hc, data = asia)
# if it does not work, try bnlearn::score()
# bnlearn::score(x = dag_mydag, data = asia)
# bnlearn::score(x = dag_Laur, data = asia)
# bnlearn::score(x = dag_learn_hc, data = asia)
Remark: The scores of dag_Laur
and dag_learn_hc
are almost the same.
In general, there are three ways of creating a bn.fit
object representing a Bayesian network:
bn.fit()
and a network structure (in a bn object) as illustrated here;We will talk about the first and the third.
Once the structure of a DAG has been determined, the parameters can be determined as well. Two most common approaches are
Parameter estimations are only based on the subset of data spanning the considered variable and its parents.
When using bnlearn
, parameter learning is performed by the bn.fit
function, which takes the network structure and the data as arguments.
asia
data set to learn three sets of parameters based on: (bn.fit(x = one_of_BN_Structures_of_your_choice, data = asia, method = "mle")
)
bn.fit()
function and find out the conditional probability table (CPT) of Lung Cancer with respect to (given) Smoking History : \(\Pr(L\,|\,S) = ?\)\[\frac{\Pr(L = yes \,|\,S = \text{yes})}{\Pr(L = yes \,|\,S = \text{no})} = ?\]
Remark As you are not given any example showing how to use bn.fit
, type simply ?bn.fit
and try to figure it out by yourself. Otherwise, you are free to check the solution below.
# By default bn.fit use mle mthod :
# bn.fit(x, data) == bn.fit(x, data, method = "mle")
bn_mydag = bn.fit(x = dag_mydag, data = asia, method = "mle")
bn_dag_Laur = bn.fit(x = dag_Laur, data = asia, method = "mle")
bn_dag_learn_hc = bn.fit(x = dag_learn_hc, data = asia, method = "mle")
# Optional
# bn_mydag_bayes = bn.fit(x = dag_mydag, data = asia, method = "bayes")
# bn_dag_Laur_bayes = bn.fit(x = dag_Laur, data = asia, method = "bayes")
# bn_dag_learn_hc_bayes = bn.fit(x = dag_learn_hc, data = asia, method = "bayes")
bn.fit()
function and find out the conditional probability table (CPT) of Lung Cancer with respect to (given) Smoking History : \(\Pr(L\,|\,S) = ?\)# You can print CPTs attached to all nodes in a given BN structure (for example bn_dag_Laur) using : `bn_dag_Laur`
# But the list of CPTs might be too long
# print just the CPT of Lung Cancer with respect to Smoking History
coef(bn_dag_Laur$L) # or simply run : bn_dag_Laur$L
#> S
#> L no yes
#> no 0.98631791 0.88230616
#> yes 0.01368209 0.11769384
# you can also represent a CPT with barchart.
bn.fit.barchart(bn_dag_Laur$L)
\[\frac{\Pr(L = yes \,|\,S = \text{yes})}{\Pr(L = yes \,|\,S = \text{no})} =\frac{0.11769384}{0.01368209} = 8.6020348\]
Specifying all the local distributions can be problematic in a large network. In most cases it can be more convenient to create a bn.fit
object from (possibly dummy) data using bn.fit()
, and then to modify just the local distributions of the nodes of interest.
bn.fit
object with coef()
, modify it, and re-save it.R
Codedag = dag_Laur
fitted = bn.fit(dag, asia)
cpt = coef(fitted$S)
# Imagine the distribution of Smoking history in a population is "no": 70% , and "yes" : 30%
cpt[1:2] = c(0.7, 0.3) # The probability distribution of node S must sum to one.
fitted$S = cpt
fitted
#>
#> Bayesian network parameters
#>
#> Parameters of node A (multinomial distribution)
#>
#> Conditional probability table:
#> no yes
#> 0.9916 0.0084
#>
#> Parameters of node B (multinomial distribution)
#>
#> Conditional probability table:
#>
#> S
#> B no yes
#> no 0.7006036 0.2823062
#> yes 0.2993964 0.7176938
#>
#> Parameters of node D (multinomial distribution)
#>
#> Conditional probability table:
#>
#> , , L = no, T = no
#>
#> B
#> D no yes
#> no 0.90017286 0.21373057
#> yes 0.09982714 0.78626943
#>
#> , , L = yes, T = no
#>
#> B
#> D no yes
#> no 0.29752066 0.13170732
#> yes 0.70247934 0.86829268
#>
#> , , L = no, T = yes
#>
#> B
#> D no yes
#> no 0.12500000 0.29166667
#> yes 0.87500000 0.70833333
#>
#> , , L = yes, T = yes
#>
#> B
#> D no yes
#> no 0.00000000
#> yes 1.00000000
#>
#>
#> Parameters of node L (multinomial distribution)
#>
#> Conditional probability table:
#>
#> S
#> L no yes
#> no 0.98631791 0.88230616
#> yes 0.01368209 0.11769384
#>
#> Parameters of node S (multinomial distribution)
#>
#> Conditional probability table:
#> no yes
#> 0.7 0.3
#>
#> Parameters of node T (multinomial distribution)
#>
#> Conditional probability table:
#>
#> A
#> T no yes
#> no 0.991528842 0.952380952
#> yes 0.008471158 0.047619048
#>
#> Parameters of node X (multinomial distribution)
#>
#> Conditional probability table:
#>
#> , , T = no
#>
#> L
#> X no yes
#> no 0.956587473 0.006134969
#> yes 0.043412527 0.993865031
#>
#> , , T = yes
#>
#> L
#> X no yes
#> no 0.000000000 0.000000000
#> yes 1.000000000 1.000000000
Inference and probability queries
The inference engine of bnlearn
is limited, but it can be used to test the networks and to perform basic operations with the models.
By using cpquery()
we can ask for the probability of an event
given a set of evidence
. We may ask for a particular combination of configurations in the BN and a set of observed statuses for the variables in the evidence. For example, we could ask, what is the posibility of a positive cancer diagnosis for a person that smokes?, in the asia
network.
Reference : http://jarias.es/bayesnetRtutorial/#inference_and_probability_queries
Interested in finding the configuration of the variables that have the highest posterior probability (discrete)
cpquery()
estimates the conditional probability of event given evidence.cpdist()
generates random observations conditional on the evidence.
Remark
Note that both cpquery()
and cpdist()
are based on Monte Carlo particle filters, and therefore they may return slightly different values on different runs.
cpquery()
Dyspnea
being yes
# cpquery(fitted, event, evidence, cluster = NULL, method = "ls", ..., debug = FALSE)
set.seed(20180308)
# Prior distribution of Dyspnoea
cpquery(fitted = bn_dag_Laur,
event = (D == "yes"),
evidence = TRUE, # evidence being true means that we have no prior evidence.
n = 100000)
#> [1] 0.46936
Dyspnea = yes
given Lung cancer being no
# Posterior distribution of Dyspnoea given Lung cancer being "no"
cpquery(fitted = bn_dag_Laur,
event = (D == "yes"),
evidence = (L == "no"),
n = 100000)
#> [1] 0.4436559
Dyspnea = yes
given Lung cancer being yes
# Posterior distribution of Dyspnoea given Lung cancer being "yes"
cpquery(fitted = bn_dag_Laur,
event = (D == "yes"),
evidence = (L == "yes"),
n = 100000)
#> [1] 0.8201835
cpdist()
# cpdist( fitted, nodes, evidence, cluster = NULL, method = "ls", ..., debug = FALSE)
set.seed(20180308)
prop.table(
table(
cpdist(fitted = bn_dag_Laur,
nodes = c("D"),
evidence = TRUE,
n = 100000))
)
#>
#> no yes
#> 0.5304898 0.4695102
prop.table(
table(
cpdist(fitted = bn_dag_Laur,
nodes = c("D"),
evidence = (L == "no"),
n = 100000))
)
#>
#> no yes
#> 0.5563441 0.4436559
prop.table(
table(
cpdist(fitted = bn_dag_Laur,
nodes = c("D"),
evidence = (L == "yes"),
n = 100000))
)
#>
#> no yes
#> 0.1765428 0.8234572
cpquery()
coef(bn_dag_Laur$L)
)coef(bn_dag_Laur$L)
)Recall : Law of total probability
\[\Pr(L = yes) = {\huge[} \Pr(L = yes\,|\,S = yes) \times \Pr(S = yes){\huge+}\Pr(L = yes\,|\,S = no)\times\Pr(S = no){\huge]} \]
coef(bn_dag_Laur$L)
coef(bn_dag_Laur$S)
0.11769384 * 0.503 + 0.01368209 * 0.497 # sum(coef(bn_dag_Laur$L) [2,] * coef(bn_dag_Laur$S))
cpquery()
set.seed(20180308)
L_prior = cpquery(fitted = bn_dag_Laur,
event = (L == "yes"),
evidence = TRUE, # evidence being true means that we have no prior evidence.
n = 100000)
L_prior
#> [1] 0.06608
L_S_yes = cpquery(fitted = bn_dag_Laur,
event = (L == "yes"),
evidence = (S == "yes"),
n = 100000)
L_S_yes
#> [1] 0.1177547
L_S_no = cpquery(fitted = bn_dag_Laur,
event = (L == "yes"),
evidence = (S == "no"),
n = 100000)
L_S_no
#> [1] 0.01359477
L_ratio = L_S_yes / L_S_no
L_ratio
#> [1] 8.661764
cpquery()
Source: https://www.norsys.com/tutorials/netica/secA/tut_A1.htm
Some people have shied away from using Bayes nets because they imagine they will only work well, if the probabilities upon which they are based are exact. This is not true. It turns out very often that approximate probabilities, even subjective ones that are guessed at, give very good results. Bayes nets are generally quite robust to imperfect knowledge. Often the combination of several strands of imperfect knowledge can allow us to make surprisingly strong conclusions.
Studies have shown people are better at estimating probabilities “in the forward direction”. For example, doctors are quite good at giving the probability estimates for “if the patient has lung cancer, what are the chances their X-ray will be abnormal?”, rather than the reverse, “if the X-ray is abnormal, what are the chances of lung cancer being the cause?” (Jensen, Finn V. (1996) An Introduction to Bayesian Networks, Springer Verlag, New York.)