*A short
time ago, after reading the excellent article by Albert Wright and following up on the recent
publication of the full results of the Golden study, I was able to point out
that the histological scores reported may also include false negatives and
false positives. *

*By the
use of a simplified model that closely mimics the Golden study I will
demonstrate in this article that “placebo false positives” and “treatment false
negatives” are the direct random result of the combination of three factors :
(i) threshold (cut-off) effects of selection criteria, (ii) biopsy sampling
variation where the unique biopsy is not representative of the whole liver and
(iii) the binary histological end-point which fails to take in global
progression/regression of the disease. These false results that
significantly distort and confuse the outcome are entirely the result of the
trial design and are independent of the efficacy of the drug used Elafibranor.*

*I will try to explain, with the strong help of Albert
Wright, why the histological 'endpoints' of clinical studies in NASH should not
be considered in isolation without looking for coherence with all the other
data now available from the clinical studies. Treating the histological data in
isolation leads to drawing the wrong conclusions both qualitatively and
quantitatively *

THE LIMITS OF LIVER BIOPSY IN NASH

The gold standard for the diagnosis of NASH is histological liver biopsy analysis. Main phase clinical studies of the completed Phase 2b and the ongoing phase 3 will thus comprise at least two liver biopsies, one before and one after treatment.

A standard liver biopsy is made by passing a needle through the abdomen or via a vein into the liver to extract a sample.

The needle extracts a sample of tissue with a length between 10 and 30 mm, and a diameter between 0.9 and 2.1 mm.

For diagnosing fibrosis the desired length should be > 25 mm

BIOPSY SAMPLING VARIATION AND THRESHOLD EFFECTS

We can
assess that an ideal biopsy for histological analysis is at least 1.6 mm in
diameter and 25 mm in length with an average weight of 30 mg. The average
weight of a human liver is about 1.5 kg so that a liver biopsy sample
corresponds to a single** random volume** of less than 1 / 50,000th of its
mass. This quantity is sufficient to make a histological analysis, but the
question remains whether such a small and unique sample is representative of
the liver as a whole.

**In the vast majority of analytical techniques, the sampling
process is taken very seriously especially if the mass to be evaluated is
heterogeneous. To assess the mineral content of a mineral resource, several
long cores are drilled out and systematically sampled at different levels to
obtain an average value of the mineral content throughout the volume. To check
for prostate cancer, some 20 biopsies are taken systematically spread
throughout the organ so as to avoid missing cancerous cells which would lead to
a “false negative” result with obvious consequences. Under ideal circumstances,
to eliminate sampling variation, a single sample should only be taken if the
sample volume is homogeneous (as with a blood sample) or can be made so by an
adequate mixing process such as by stirring a liquid. **

**The question we have to ask is therefore the following : Is a diseased liver sufficiently
homogeneous to the point where taking a unique sample of 1/50,000 of its mass
can be relied trusted to eliminate sampling variation and hence provide
reliable quantitative results for use in a clinical trial ? **

This
question was answered in an article published in 2006 by Vlad RATZIU *et
al*, in which two biopsies
were taken from 51 patients with NASH or NAFLD.

http://www.gastrojournal.org/article/S0016-5085(05)00630-X/pdf

The results and conclusions were very clear :

**Results: **No features displayed
high agreement; substantial agreement was only seen for steatosis grade;
moderate agreement for hepatocyte ballooning and perisinusoidal fibrosis; fair
agreement for Mallory bodies; acidophilic bodies and lobular inflammation
displayed only slight agreement.

Overall,
the discordance rate for the presence of hepatocyte ballooning was 18%, and
ballooning would have been missed in 24% of patients had only 1 biopsy been
performed. **The negative predictive value of a single biopsy for the
diagnosis of NASH was at best 0.74. Discordance of 1 stage or more was
41% »**

**Conclusions: **Histologic lesions of NASH are unevenly distributed
throughout the liver parenchyma; therefore, **sampling error of liver biopsy
can result in substantial misdiagnosis and staging inaccuracies**

A SIMPLE MODEL TO MIMIC NASH SAMPLING

In this simple model we consider the liver as a black bag containing 50,000 coloured balls ranging from white to dark red depending on the stage of the disease. These are the potential biopsy samples

A healthy liver would contain only white balls, and liver at the terminal cirrhosis stage would contain only dark red balls. Vermilion corresponding to an established NASH.

Imagine that we choose to carry out a liver biopsy on a patient with NASH. This corresponds to blindly pulling out one of 50,000 balls from the bag and looking at its colour. We will have a ball with a colour ranging from white to dark red.

But the consensus definition of NASH arbitrarily works with a threshold, in our model, vermilion or darker.

All balls of a colour paler than the vermilion are not regarded as NASH whereas all the darker ones are proof of NASH. The end point of the clinical study is binary : Paler than vermilion = No NASH, Darker = NASH. It is also unique for each patient : You can only pick one ball per patient.

Because of the binary nature of the end-point, we can simplify the model without changing the outcome. All the balls paler than vermilion can be painted white and all those vermilion or darker can be painted red. The outcome is exactly the same.

Then, we would have a bag of 50,000 balls of only two colours, white and red.

White = no NASH,

red = NASH.

**NOW,
if we draw a ball at random, can we honestly judge that the result obtained is truly
representative of the stage of the NASH ?**

Well, it could depend on the severity of the disease :

If we draw a single white ball without having any other information, we cannot deduce anything specific about the severity of the NASH of the patient. We can not affirm that the patient does not have NASH because we have only sampled 1/50,000 of the liver and we know nothing about its homogeneity. The statistics are simply loaded against a reliable quantitative result.

If a red ball is drawn in isolation, it can be inferred that the patient does indeed have 'NASH' in this small sample but we cannot know how much of the liver is affected.

You now can see that drawing a white ball can easily lead to a false negative result, whereas drawing a red ball gives a true positive but gives no information about how much of the liver is affected.

This method can only be reliable if we significantly increase the number of samples and locations in the liver or if we take bigger samples more representative of the whole liver (equivalent to having a smaller number of coloured balls of a much larger size. Compare this with prostate cancer biopsies where 20 samples are taken for an organ 30 times smaller so as to reduce the possibility of false negatives.

The reliability of the method increases at both extremes of the spectrum of the disease.

If the bag is full or nearly full of white balls, then the probability of drawing a white ball consistent with the stage of the disease is also higher. Similarly, if the bag is full or nearly full of red balls, then we have a higher probability of drawing a red ball consistent with the disease.

In the middle range, we have a 50% probability of drawing either red or white and equally a 50% chance of error.

To further complicate matters, a clinical study is carried out two runs, before and after treatment, so the sampling errors of the two runs add up. Except for the extreme cases, the sampling errors make the exercise look like a lottery.

THRESHOLD EFFECT OF SELECTION CRITERIA

**To
comply with the selection criteria, patients selected for the trial must have
histological NASH. This is equivalent to drawing a red ball.**

In our model we treat 50% of patients with placebo and 50% with the drug.

At the end of the study we make another biopsy. We blindly draw a single ball again.

If we draw a white ball the patient is considered cured and if we draw a red ball the patient still has Nash ...

I now guess that you are beginning to understand the problem !

The Placebo Effect

Consider a patient who has a mild stage of the disease, say 25% of red balls in the bag. This patient was recruited into the study because he had drawn a red ball before treatment even though we had only a 25% chance of finding the patient sick. That is the luck of the draw. The biopsy was taken from a zone where the NASH was strong, he drew a red ball. Now there's the second case of bad luck. This patient had a 50/50 chance of falling into the placebo group. In the double blind method, this patient drew the placebo and received no treatment for the disease.

At the end of the study we made a second draw and this time, because the treatment had not evolved there were still 75% of white balls in the bag. There was therefore a 75% chance of drawing a white ball and concluding that the patient had been cured !!!

This the statistical chance of making a false
positive with a placebo under these conditions. There was a 25% chance of being
selected sick on the first draw. However, once selected this patient is
considered to be just as sick as any other in the study. At the end, this
patient has a 75% of being considered cured even when on placebo. ** This is the problem when only mildly sick
patients are included in a study in which the end-point has a significant
sampling or analytical error. You can now see that this result has absolutely
nothing to do with the efficacy of the drug being tested. It's all about how
sampling errors impact on the two threshold effects of inclusion and end-point.
Nothing else.**

On a patient who had a fairly advanced disease, say 75% red balls in the bag, this patient has also been retained in the study because he drew a red ball before the study.

We had a 75% chance of actually including a truly sick patient according to the definition of the trial.

If the patient was on placebo and nothing changed in his state during the study, at the end of the trial we still have a 25% chance of finding the patient cured even though nothing really happened !!!

**So
in the placebo arm, the probability of false positives increases strongly when
mildly affected patients are included. This is purely a statistical affect and
the patients were not miraculously cured by themselves. This bias could have
been more easily identified, if we had included mildly sick patients (25% red balls)
who drew a white ball on the first draw. If we had included such patients, then
by the same statistics we would have 25% of such patients declared sick after
treatment with placebo even though they were declared healthy on the first
draw. Here the selection threshold is also the cause of the problem.**

**The
distortion is due to the fact that we work on a population that was
statistically unlikely to be included in the study (25%), but which, once
included by chance as being sick, had a 75% chance of being declared cured
without treatment at the end.**

**Placebo
Effect vs. Drug Effect (mildy sick patients, 25% red)**

For
patients receiving the drug, we assume the same sampling statistics (and assume
that the drug is effective). Let us take the case where the **drug actually
causes regression of the disease by 50%.**

On a patient who had a mild case of the disease, say 25% red balls in the bag, there was a 25% chance of him being selected as sick.

If the patient is in the treatment arm... at the end of the trial he has only 12.5% of red balls in his bag and performing a second draw .. and there was 87.5% chance of him being declared cured.

Declared
rate of cure : **placebo 75%, drug 87,5%**

**So,
for patients who are only mildly sick, the histological method of measurement
using a binary threshold for both selection and end-point creates a strong
placebo effect. The effect of even a highly effective drug is barely
observable.**

**Placebo
Effect vs. Drug Effect (severely sick patients, 75% red)**

On a patient who had advanced disease, say 75% red balls in the bag, there was a 75% chance of him being selected as sick.

If the patient is in the treatment arm... at the end of the study, the disease is reduced by 50%, so he now has only 37.5% of red balls in his bag. On the second biopsy, he has 62.5% chance drawing a white ball and being declared cured.

Declared
rate of cure : **placebo 25%, drug 62,5%**

**So,
for patients who are severely sick, using the same histological method the
placebo effect is very much reduced and the efficacy of the drug becomes clear.**

**Resuming this important observation**

*·
***Patients with mild disease (25%) and 50% regression on drug**

*Probability
of positive observation of cure by placebo : 75%*

*Probability
of positive observation of cure by drug : 87.5%*

*87.5
/ 75 = 1.16 observed efficacy ratio and 16,6% numerical difference.*

*·
***Patients with advanced disease (75%) and 50% regression on drug**

*Probability
of positive observation of cure by placebo : 25%*

*Probability
of positive observation of cure by drug : 62.5%*

*62.5 / 25 = 2.5 observed efficacy ratio and 150%
numerical difference.*

**This
histological measurement method is therefore highly dependent on the severity
of the disease and unsuitable for mild disease patients. This can be verified
quickly if we look at the extremes. **

*·
***Patients with very mid disease (10%) and 50% regression on drug**

*Probability
of positive observation of cure by placebo : 90%*

*Probability
of positive positive observation of cure by drug : 95%*

*95/90
= 1.055 observed efficacy ratio and 5,5% numerical difference.*

*·
***Patients with very advanced disease (90%) and 50% regression on drug**

*Probability
of positive observation of cure by placebo : 10%*

*Probability
of positive positive observation of cure by drug : 55%*

*55/10
= 5.5 observed efficacy ratio and 450% numerical difference*

If we draw a graph showing the discriminatory nature of the method according to the severity of the patient's disease at study entry, one can easily see this point.

*title of graphic : discrimination ratio vs NASH severity*

**So
how should we interpret the histological results?**

To limit the statistical effect of a histological sample of 1/50,000 of the liver mass and its binary interpretation of reversion, several other parameters are available to us in the study results.

A second analysis of the colour of the balls can be made, this time taking into account not 2 colours but the 8 colours of the NAS score

A ball is drawn before treatment and one after treatment and looking at the intensity of the colour of each ball. If the second ball is lighter than the first one imagines that the patient gets better.

A statistical approach can rapidly demonstrate that we are not really much better off as in the previous method .. By picking at random only one ball out of 50000 we cannot draw any definitive conclusions. The statistics are still stacked against a reliable outcome.

The biopsy sampling rules try to limit this lottery effect by requiring a minimum length of the biopsy, and by taking several readings on different parts of the sample. The problem remains that a single biopsy only provides a picture of a small part of the liver when we need to find a global average for the whole liver.

**This
is where serological markers come into play.**

Biopsy samples give us a physical and static picture of the liver in a specific small volume. Blood markers tell us about how the liver is functioning as a whole. They tell us about a specific biological liver function produced by the whole liver. Since they can be taken at regular intervals, they also tell us about the progression or regression of the health of the liver. These two methods are complementary.

Imagine the colour of the balls in the bag is water soluble. They don't just have a colour, they have a function. They interact with the water just as liver cells interact with the blood of the patient. Healthy cells clean up the blood (white), sick cells release toxins into the blood (red). If the bag is dipped in water for some time, the water will be coloured to the average colour of the balls. The more red balls, more the water will turn red. Looking at the intensity of the colour of the water we can get a good idea of the proportion of red, pink and white balls. We get a quantitative value which can be accurately measured. This is our blood marker. The more strongly it is coloured, the more the patient is sick.

If the water is very red, we deduce that there is a majority of very red balls in the bag. If the water is clear, we can deduce that there are few red balls. If it is pink, we will not know if it is the result of a few very red balls or a large number of pink balls.

But what is interesting is that we have an overall idea of the progress of the disease even though we don't know exactly what is happening in the bag. There can be a lot of pink balls but no red balls, so no NASH or some red balls and therefore some NASH. The liquid marker has the advantage that it does not have a threshold effect.

Now if we correlating the colour of the histological result with the colour of the liquid we can eliminate some errors.

Initially all the patients had drawn a red ball (to be selected for the study).

On the second draw, after treatment, a white ball is considered as a cure (positive) and a red ball, a lack of cure (negative).

**Imagine
a patient under placebo**

The initial liquid was pink. One can then imagine that we are dealing with a mildly sick patient.

·
If
at the end of the study we draw a red ball and the liquid is still pink, red or
a little more, we deduce that this patient is not cured. **It's a
real negative!**

· If at the end of the study we draw a
white ball and the liquid is still pink or a little red .. this patient,
according to histological criteria, is considered to be completely cured, but
our liquid marker says that this patient is still just as sick as before.

**The lack of consistency between the two methods indicates that this is
a false positive!**

**Now
imagine a patient under treatment,**

Initially a red ball was drawn (to be selected for the study) and the liquid was dark red.

One can then deduce that we are dealing with a very sick patient.

- If,
at the end of the study, we draw a red ball and the liquid is still dark red,
we can deduce that this patient has not been cured.
**It's a true negative!**

- If, at the end of the study, a white ball is drawn and the liquid is
still red or dark red, this patient, according to histological criteria, is
considered completely cured but the liquid marker does not agree. It is clear
that the functional state of the patient's liver has not improved. The treatment did not work.
**This is a false positive!**

- If, at the end of the study, a white ball is drawn
and the liquid is clear white or pink, this patient is according to
histological criteria, considered cured, and liquid marker confirms this. This
patient is cured,
**a true positive!**

- If at the end of the study a red ball is drawn and
the liquid is clear white or pink, according to histological criteria this
patient is still considered sick but the liquid marker shows an overall
improvement in liver function. The patient is getting better but the biopsy but
is not representative of the overall state of the liver,
**it's a false negative**

**It
is therefore important to read the histological findings in the light of
serological results otherwise we risk misinterpreting the data with false
positives and false negatives.**

**And
that is why it is urgent to develop specific serological markers for the
disease, biopsies are too unreliable because of random sampling errors and
interpretation thresholds.**

_______________

HOW THE GOLDEN STUDY CAN BE SCREENED

Take the example of the GOLDEN study of GENFIT, in the article published in Gastroenterology, a passage from the discussion merits attention:

*Histological
advantage of the 120 mg dose was reflected by a significant improvement in the
tests of liver function, in particular, ALT, GGT and alkaline phosphatase, and
non-invasive serum panels of steatosis (SteatoTest ®, FLI) and fibrosis (NAFLD
fibrosis score and Fibrotest ®), which are more sensitive and earlier response
indicators than histology.*

Metabolic markers are clearly cited. These are ALT, Gamma GT and alkaline phosphatase. All three are recognized markers of liver disease.

It is therefore important to look at 'the colour of the water' in placebo patients and those receiving treatment beyond the histology results and try to correlate them.

The histological results showed 20% of patients and 10% of placebo patients with histological reversion at the end of the study (having drawn a white ball at the end of the study according to our model).

Some observers read this as 20% of responders to treatment.

**If
this were the case, it would show up in metabolic markers ..**

**Let
us study the evolution of alkaline phosphatase (ALP) to check **..

At the end of 52 weeks of treatment, ie when the second biopsy was done, we see that none of the patients in the placebo arm had improved this marker. Without exception they were all aggravated.

**To
return to our model, the 10% of placebo patients having drawn a white ball at
the end of the trial are false positives, serological markers show no change
for the better, but rather a worsening of liver function.**

In our model the water became even redder, and the white ball drawn out of the bag was not consistent with the liquid colour..

**These
10% of supposed cures on placebo treated patients were without doubt
clearly false positives.**

Compare
now the patients treated with a dose of 120 mg / d, according to the
histological findings, **80% of patients drew a red ball at the end
of study and were not declared cured by the histological criteria of the
endpoint .**.

**The
histological result is inconsistent with the data for ALP**

In the
graph we see that **100% of patients treated with Elafibranor have
seen a very significant improvement in their serological markers.** Here
between -17% and -24% decrease in alkaline phosphatase while patients on
placebo have increased their rates between 1% and 7%.

80% of
patients on treatment have therefore drawn a red ball at the end of study,
while the colour of the water was progressively getting clearer ...

**This inconsistency suggests a tendency towards false negatives.**

And this analysis is also confirmed for ALT and GGT

**To
give a concrete example.**

**If
the criterion of end of study (endpoint) was a combined end-point of histology
and serology, the GOLDEN results would be spectacular.**

**An
identical reversion criterion as in the original study, combined with a 10%
minimum requirement of decrease for ALT, would have given the
following results.**

Elafibranor 21% vs placebo 0% with an infinitesimal p!

**And
for 80% of patients without histological reversion, all have significant
metabolic improvements that clearly show they are responding to treatment,
whereas the histological results does not show this trend. Which are we
to believe, unreliable histological results or precise biological data ?**

**This is why those concerned
before a reversion rate of only 20% of NASH in Golden, should take the time to
review the study carefully and they will understand better the positive
conclusions of the authors of the article.**

**Conclusions**

**The simple statistical model
used in this article demonstrates how the random placebo effect considerably
confused the histological outcome of the Golden Phase IIB results. Only by
confronting these data with biological markers can we eliminate placebo false
positives and treatment false negatives. In the forthcoming phase III trial now
recruiting, the histological end point has been retained based on a single
biopsy at the beginning and end of the trial even though the interpretation of
the biopsy has been changed. **

**The placebo effect for
mildly sick patients will still be just as strong as before. This has been
recognised in that only patients with NAS 4 of or greater will be included in
the trial. This should reduce the observed number of false placebo positives in
the outcome but will not eliminate them. The best outcome (with few placebo
false positives) will be achieved by recruiting a large proportion of severely
sick patients, but also by taking account of initial biomarker analysis in order
to eliminate patients who, due to the random nature of the biopsy, have
recorded a high histological score, but whose biological examination reveal
only a mild stage of NASH.**

** **

*G DIVRY and A.WRIGHT*

**Annexe**

**This slide demonstrate how hepatic biopsy is not always accurate in fibrosis/cirrhosis détection**.