by Gery Divry and Albert Wright


A short time ago, after reading  the excellent article by Albert Wright and following up on the recent publication of the full results of the Golden study, I was able to point out that the histological scores reported may also include false negatives and false positives.

By the use of a simplified model that closely mimics the Golden study I will demonstrate in this article that “placebo false positives” and “treatment false negatives” are the direct random result of the combination of three factors : (i) threshold (cut-off) effects of selection criteria, (ii) biopsy sampling variation where the unique biopsy is not representative of the whole liver and (iii) the binary histological end-point which fails to take in global progression/regression of the disease. These false results that significantly distort and confuse the outcome are entirely the result of the trial design and are independent of the efficacy of the drug used Elafibranor.

I will try to explain, with the strong help of Albert Wright, why the histological 'endpoints' of clinical studies in NASH should not be considered in isolation without looking for coherence with all the other data now available from the clinical studies. Treating the histological data in isolation leads to drawing the wrong conclusions both qualitatively and quantitatively    



The gold standard for the diagnosis of NASH is histological liver biopsy analysis. Main phase clinical studies of the completed Phase 2b and the ongoing phase 3 will thus comprise at least two liver biopsies, one before and one after treatment.

 A standard liver biopsy is made by passing a needle through the abdomen or via a vein into the liver to extract a sample. 



The needle extracts a sample of tissue with a length between 10 and 30 mm, and a diameter  between 0.9 and 2.1 mm.

For diagnosing fibrosis the desired length should be > 25 mm


We can assess that an ideal biopsy for histological analysis is at least 1.6 mm in diameter and 25 mm in length with an average weight of 30 mg. The average weight of a human liver is about 1.5 kg so that a liver biopsy sample corresponds to a single random volume of less than 1 / 50,000th of its mass. This quantity is sufficient to make a histological analysis, but the question remains whether such a small and unique sample is representative of the liver as a whole.

In the vast majority of analytical techniques, the sampling process is taken very seriously especially if the mass to be evaluated is heterogeneous. To assess the mineral content of a mineral resource, several long cores are drilled out and systematically sampled at different levels to obtain an average value of the mineral content throughout the volume. To check for prostate cancer, some 20 biopsies are taken systematically spread throughout the organ so as to avoid missing cancerous cells which would lead to a “false negative” result with obvious consequences. Under ideal circumstances, to eliminate sampling variation, a single sample should only be taken if the sample volume is homogeneous (as with a blood sample) or can be made so by an adequate mixing process such as by stirring a liquid.

The question we have to ask is therefore the following : Is a diseased liver sufficiently homogeneous to the point where taking a unique sample of 1/50,000 of its mass can be relied trusted to eliminate sampling variation and hence provide reliable quantitative results for use in a clinical trial ?

This question was answered in an article published in 2006 by Vlad RATZIU et al, in which two biopsies were taken from 51 patients with NASH or NAFLD.

The results and conclusions were very clear :

Results: No features displayed high agreement; substantial agreement was only seen for steatosis grade; moderate agreement for hepatocyte ballooning and perisinusoidal fibrosis; fair agreement for Mallory bodies; acidophilic bodies and lobular inflammation displayed only slight agreement.

Overall, the discordance rate for the presence of hepatocyte ballooning was 18%, and ballooning would have been missed in 24% of patients had only 1 biopsy been performed. The negative predictive value of a single biopsy for the diagnosis of NASH was at best 0.74. Discordance of 1 stage or more was 41% »

Conclusions: Histologic lesions of NASH are unevenly distributed throughout the liver parenchyma; therefore, sampling error of liver biopsy can result in substantial misdiagnosis and staging inaccuracies



In this simple model we consider the liver as a black bag containing 50,000 coloured balls ranging from white to dark red depending on the stage of the disease. These are the potential biopsy samples

A healthy liver would contain only white balls, and liver at the terminal cirrhosis stage would contain only dark red balls. Vermilion corresponding to an established NASH.

Imagine that we choose to carry out a liver biopsy on a patient with NASH. This corresponds to blindly pulling out one of 50,000 balls from the bag and looking at its colour. We will have a ball with a colour ranging from white to dark red.

But the consensus definition of NASH arbitrarily works with a threshold, in our model, vermilion or darker.

All balls of a colour paler than the vermilion are not regarded as NASH whereas all the darker ones are proof of NASH. The end point of the clinical study is binary : Paler than vermilion = No NASH, Darker = NASH.  It is also unique for each patient : You can only pick one ball per patient. 

Because of the binary nature of the end-point, we can simplify the model without changing the outcome. All the balls paler than vermilion can be painted white and all those vermilion or darker can be painted red. The outcome is exactly the same.

Then, we would have a bag of 50,000 balls of only two colours, white and red.

White = no NASH,

red = NASH.


NOW, if we draw a ball at random, can we honestly judge that the result obtained is truly representative of the stage of the NASH ?

Well, it could depend on the severity of the disease :

If we draw a single white ball without having any other information, we cannot deduce anything specific about the severity of the NASH of the patient. We can not affirm that the patient does not have NASH because we have only sampled 1/50,000 of the liver and we know nothing about its homogeneity. The statistics are simply loaded against a reliable quantitative result.

If a red ball is drawn in isolation, it can be inferred that the patient does indeed have 'NASH' in this small sample but we cannot know how much of the liver is affected.

You now can see that drawing a white ball can easily lead to a false negative result, whereas drawing a red ball gives a true positive but gives no information about how much of the liver is affected.

This method can only be reliable if we significantly increase the number of samples and locations in the liver or if we take bigger samples more representative of the whole liver (equivalent to having a smaller number of coloured balls of a much larger size. Compare this with prostate cancer biopsies where 20 samples are taken for an organ 30 times smaller so as to reduce the possibility of false negatives.

The reliability of the method increases at both extremes of the spectrum of the disease.

If the bag is full or nearly full of white balls, then the probability of drawing a white ball consistent with the stage of the disease is also higher. Similarly, if the bag is full or nearly full of red balls, then we have a higher probability of drawing a red ball consistent with the disease.

In the middle range, we have a 50% probability of drawing either red or white and equally a 50% chance of error.

To further complicate matters, a clinical study is carried out two runs, before and after treatment, so the sampling errors of the two runs add up. Except for the extreme cases, the sampling errors make the exercise look like a lottery.


To comply with the selection criteria, patients selected for the trial must have histological NASH. This is equivalent to drawing a red ball.

In our model we treat 50% of patients with placebo and 50% with the drug.

At the end of the study we make another biopsy. We blindly draw a single ball again.

If we draw a white ball the patient is considered cured and if we draw a red ball the patient still has Nash ...


I now guess that you are beginning to understand the problem ! 


The Placebo Effect

Consider a patient who has a mild stage of the disease, say 25% of red balls in the bag. This patient was recruited into the study because he had drawn a red ball before treatment even though we had only a 25% chance of finding the patient sick. That is the luck of the draw. The biopsy was taken from a zone where the NASH was strong, he drew a red ball. Now there's the second case of bad luck. This patient had a 50/50 chance of falling into the placebo group. In the double blind method, this patient drew the placebo and received no treatment for the disease.

At the end of the study we made a second draw and this time, because the treatment had not evolved there were still 75% of white balls in the bag. There was therefore a 75% chance of drawing a white ball and concluding that the patient had been cured !!!

This the statistical chance of making a false positive with a placebo under these conditions. There was a 25% chance of being selected sick on the first draw. However, once selected this patient is considered to be just as sick as any other in the study.  At the end, this patient has a 75% of being considered cured even when on placebo.  This is the problem when only mildly sick patients are included in a study in which the end-point has a significant sampling or analytical error. You can now see that this result has absolutely nothing to do with the efficacy of the drug being tested. It's all about how sampling errors impact on the two threshold effects of inclusion and end-point. Nothing else.



On a patient who had a fairly advanced disease, say 75% red balls in the bag, this patient has also been retained in the study because  he drew a red ball before the study. 

We had a 75% chance of actually including a truly sick patient according to the definition of the trial.

If the patient was on placebo and nothing changed in his state during the study, at the end of the trial we still have a 25% chance of finding the patient cured even though nothing really happened !!!

So in the placebo arm, the probability of false positives increases strongly when mildly affected patients are included. This is purely a statistical affect and the patients were not miraculously cured by themselves. This bias could have been more easily identified, if we had included mildly sick patients (25% red balls) who drew a white ball on the first draw. If we had included such patients, then by the same statistics we would have 25% of such patients declared sick after treatment with placebo even though they were declared healthy on the first draw. Here the selection threshold is also the cause of the problem.

The distortion is due to the fact that we work on a population that was statistically unlikely to be included in the study (25%), but which, once included by chance as being sick, had a 75% chance of being declared cured without treatment at the end.


Placebo Effect vs. Drug Effect (mildy sick patients, 25% red)

For patients receiving the drug, we assume the same sampling statistics (and assume that the drug is effective). Let us take the case where the drug actually causes regression of the disease by 50%.

On a patient who had a mild case of the disease, say 25% red balls in the bag, there was a 25% chance of him being selected as sick. 

If the patient is in the treatment arm... at the end of the trial he has only 12.5% of red balls in his bag and performing a second draw .. and there was 87.5% chance of him being declared cured.

Declared rate of cure : placebo 75%, drug 87,5%

So, for patients who are only mildly sick, the histological method of measurement using a binary threshold for both selection and end-point creates a strong placebo effect. The effect of even a highly effective drug is barely observable.



Placebo Effect vs. Drug Effect (severely sick patients, 75% red)

On a patient who had advanced disease, say 75% red balls in the bag, there was a 75% chance of him being selected as sick.

If the patient is in the treatment arm... at the end of the study, the disease is reduced by 50%, so he now has only 37.5% of red balls in his bag. On the second biopsy, he has  62.5% chance drawing a white ball and being declared cured.

Declared rate of cure : placebo 25%, drug 62,5%

So, for patients who are severely sick, using the same histological method the placebo effect is very much reduced and the efficacy of the drug becomes clear.

Resuming this important observation

·     Patients with mild disease (25%) and 50% regression on drug

Probability of positive observation of cure by placebo : 75%

Probability of positive observation of cure by drug : 87.5%

87.5 / 75 = 1.16 observed efficacy ratio and 16,6% numerical difference.

·     Patients with advanced disease (75%) and 50% regression on drug

Probability of positive observation of cure by placebo : 25%

Probability of positive observation of cure by drug : 62.5%

62.5 / 25 = 2.5 observed efficacy ratio and 150% numerical difference.


This histological measurement method is therefore highly dependent on the severity of the disease and unsuitable for mild disease patients. This can be verified quickly if we look at the extremes.

·     Patients with very mid disease  (10%) and 50% regression on drug

Probability of positive observation of cure by placebo : 90%

Probability of positive positive observation of cure by drug :  95%

95/90 = 1.055 observed efficacy ratio and 5,5% numerical difference.

·     Patients with very advanced disease (90%) and 50% regression on drug

Probability of positive observation of cure by placebo : 10%

Probability of positive positive observation of cure by drug :  55%

55/10 = 5.5 observed efficacy ratio and 450% numerical difference

If we draw a graph showing the discriminatory nature of the method according to the severity of the patient's disease at study entry, one can easily see this point.

title of graphic : discrimination ratio vs NASH severity

So how should we interpret the histological results?

To limit the statistical effect of a histological sample of 1/50,000 of the liver mass and its binary interpretation of reversion, several other parameters are available to us in the study results.

A second analysis of the colour of the balls can be made, this time taking into account not 2 colours but the 8 colours of the NAS score

A ball is drawn before treatment and one after treatment and looking at the intensity of the colour of each ball. If the second ball is lighter than the first one imagines that the patient gets better.

A statistical approach can rapidly demonstrate that we are not really much better off as in the previous method .. By picking at random only one ball out of 50000 we cannot draw any definitive conclusions. The statistics are still stacked against a reliable outcome.

The biopsy sampling rules try to limit this lottery effect  by requiring a minimum length of the biopsy, and by taking several readings on different parts of the sample. The problem remains that a single biopsy only provides a picture of a small part of the liver when we need to find a global average for the whole liver.


This is where serological markers come into play.

Biopsy samples give us a physical and static picture of the liver in a specific small volume.  Blood markers tell us about how the liver is functioning as a whole. They tell us about a specific biological liver function produced by the whole liver. Since they can be taken at regular intervals, they also tell us about the progression or regression of the health of the liver. These two methods are complementary.


Imagine the colour of the balls in the bag is water soluble. They don't just have a colour, they have a function. They interact with the water just as liver cells interact with the blood of the patient. Healthy cells clean up the blood (white), sick cells release toxins into the blood (red). If the bag is dipped in water for some time, the water will be coloured to the average colour of the balls. The more red balls, more the water will turn red. Looking at the intensity of the colour of the water we can get a good idea of ​​the proportion of red, pink and white balls. We get a quantitative value which can be accurately measured. This is our blood marker. The more strongly it is coloured, the more the patient is sick.

If the water is very red, we deduce that there is a majority of very red balls in the bag. If the water is clear, we can deduce that there are few red balls. If it is pink, we will not know if it is the result of a few very red balls or a large number of pink balls.

But what is interesting is that we have an overall idea of ​​the progress of the disease even though we don't know exactly ​​what is happening in the bag. There can be a lot of pink balls but no red balls, so no NASH or some red balls and therefore some NASH. The liquid marker has the advantage that it does not have a threshold effect.

Now if we correlating the colour of the histological result with the colour of the liquid we can eliminate some errors.

Initially all the patients had drawn a red ball (to be selected for the study).

On the second draw, after treatment, a white ball is considered as a cure (positive) and a red ball, a lack of cure (negative).

Imagine a patient under placebo

The initial liquid was pink. One can then imagine that we are dealing with a mildly sick patient.

·     If at the end of the study we draw a red ball and the liquid is still pink, red or a little more, we deduce that this patient is not cured. It's a real negative!

·     If at the end of the study we draw a white ball and the liquid is still pink or a little red .. this patient, according to histological criteria, is considered to be completely cured, but our liquid marker says that this patient is still just as sick as before. 
The lack of consistency between the two methods indicates that this is a false positive!


Now imagine a patient under treatment,

Initially a red ball was drawn (to be selected for the study) and the liquid was dark red.

One can then deduce that we are dealing with a very sick patient.

  • If, at the end of the study, we draw a red ball and the liquid is still dark red, we can deduce that this patient has not been cured. It's a true negative!

  •   If, at the end of the study, a white ball is drawn and the liquid is still red or dark red, this patient, according to histological criteria, is considered completely cured but the liquid marker does not agree. It is clear that the functional state of the patient's liver has not improved.  The treatment did not work. This is a false positive!    


  • If, at the end of the study, a white ball is drawn and the liquid is clear white or pink, this patient is according to histological criteria, considered cured, and liquid marker confirms this. This patient is cured, a true positive!    

  • If at the end of the study a red ball is drawn and the liquid is clear white or pink, according to histological criteria this patient is still considered sick but the liquid marker shows an overall improvement in liver function. The patient is getting better but the biopsy but is not representative of the overall state of the liver, it's a false negative    


It is therefore important to read the histological findings in the light of serological results otherwise we risk misinterpreting the data with false positives and false negatives.

And that is why it is urgent to develop specific serological markers for the disease, biopsies are too unreliable because of random sampling errors and interpretation thresholds.




Take the example of the GOLDEN study of GENFIT, in the article published in Gastroenterology, a passage from the discussion merits attention:

Histological advantage of the 120 mg dose was reflected by a significant improvement in the tests of liver function, in particular, ALT, GGT and alkaline phosphatase, and non-invasive serum panels of steatosis (SteatoTest ®, FLI) and fibrosis (NAFLD fibrosis score and Fibrotest ®), which are more sensitive and earlier response indicators than histology.

Metabolic markers are clearly cited. These are ALT, Gamma GT and alkaline phosphatase. All three are recognized markers of liver disease.

It is therefore important to look at 'the colour of the water' in placebo patients and those receiving treatment beyond the histology results and try to correlate them.

The histological results showed 20% of patients and 10% of placebo patients with histological reversion at the end of the study (having drawn a white ball at the end of the study according to our model). 

Some observers read this as 20% of responders to treatment.

If this were the case, it would show up in metabolic markers ..

Let us study the evolution of alkaline phosphatase (ALP) to check ..

At the end of 52 weeks of treatment, ie when the second biopsy was done, we see that none of the patients in the placebo arm had improved this marker. Without exception they were all aggravated.

To return to our model, the 10% of placebo patients having drawn a white ball at the end of the trial are false positives, serological markers show no change for the better, but rather a worsening of liver function.

In our model the water became even redder, and the white ball drawn out of the bag was not consistent with the liquid colour..

These 10% of supposed cures on placebo treated patients were without doubt clearly false positives.

Compare now the patients treated with a dose of 120 mg / d, according to the histological findings, 80% of patients drew a red ball at the end of study and were not declared cured by the histological criteria of the endpoint ..

The histological result is inconsistent with the data for ALP

In the graph we see that 100% of patients treated with Elafibranor have seen a very significant improvement in their serological markers. Here between -17% and -24% decrease in alkaline phosphatase while patients on placebo have increased their rates between 1% and 7%.

80% of patients on treatment have therefore drawn a red ball at the end of study, while the colour of the water was progressively getting clearer ... 
This inconsistency suggests a tendency towards false negatives.

And this analysis is also confirmed for ALT and GGT    

To give a concrete example.

If the criterion of end of study (endpoint) was a combined end-point of histology and serology, the GOLDEN results would be spectacular.

An identical reversion criterion as in the original study, combined with a 10% minimum requirement of decrease for  ALT, would have given the following results.

  Elafibranor 21% vs placebo 0% with an infinitesimal p!

And for 80% of patients without histological reversion, all have significant metabolic improvements that clearly show they are responding to treatment, whereas the histological results does not show this trend.  Which are we to believe, unreliable histological results or precise biological data ?

This is why those concerned before a reversion rate of only 20% of NASH in Golden, should take the time to review the study carefully and they will understand better the positive conclusions of the authors of the article.



The simple statistical model used in this article demonstrates how the random placebo effect considerably confused the histological outcome of the Golden Phase IIB results. Only by confronting these data with biological markers can we eliminate placebo false positives and treatment false negatives. In the forthcoming phase III trial now recruiting, the histological end point has been retained based on a single biopsy at the beginning and end of the trial even though the interpretation of the biopsy has been changed.

The placebo effect for mildly sick patients will still be just as strong as before. This has been recognised in that only patients with NAS 4 of or greater will be included in the trial. This should reduce the observed number of false placebo positives in the outcome but will not eliminate them. The best outcome (with few placebo false positives) will be achieved by recruiting a large proportion of severely sick patients, but also by taking account of initial biomarker analysis in order to eliminate patients who, due to the random nature of the biopsy, have recorded a high histological score, but whose biological examination reveal only a mild stage of NASH.




This slide demonstrate how hepatic biopsy is not always accurate in fibrosis/cirrhosis détection.

biopsy varia

Share on StockTwits

WWW.NASHBIOTECHS.COM  -  Copyright G DIVRY 2015-2016