The Deadly Corruption of Clinical Trials


A 2006 study in The American Journal of Psychiatry, which looked at 32 head-to-head trials of atypicals, found that 90 percent of them came out positively for whichever company had designed and financed the trial. This startling result was not a matter of selective publication. The companies had simply designed the studies in a way that virtually ensured their own drugs would come out ahead—for instance, by dosing the competing drugs too low to be effective, or so high that they would produce damaging side effects. Much of this manipulation came from biased statistical analyses and rigged trial designs of such complexity that outside reviewers were unable to spot them. As Dr. Richard Smith, the former editor of the British Medical Journal, has pointed out, “The companies seem to get the results they want not by fiddling the results, which would be far too crude and possibly detectable by peer review, but rather by asking the ‘right’ questions.”

The Deadly Corruption of Clinical Trials

Hangover Remedy and the Scientific Method

Here’s a quick summary of the challenges of implementing scientific method to test a supposed remedy:

The acid test, however, is in clinical trials, with human beings, and these are complicated. Basically, what you have to do is give a group of people a lot to drink, apply the remedy in question, and then, the next morning, score them on a number of measures in comparison with people who consumed the same amount of alcohol without the remedy. But there are many factors that you have to control for: the sex of the subjects; their general health; their family history; their past experience with alcohol; the type of alcohol you give them; the amount of food and water they consume before, during, and after; and the circumstances under which they drink, among other variables. (Wiese and his colleagues, in their prickly-pear experiment, provided music so that the subjects could dance, as at a party.) Ideally, there should also be a large sample—many subjects.



Survivorship Bias: Ultramarathoners suffer injuries but most may be minor, a study finds

A study of 396 ultramarathoners found that while many suffer injuries throughout the course of their race, the vast majority of them are minor. Researchers looked at medical data on runners who competed in Racing the Planet 4 Deserts series, a four-part ultra-race that takes place over seven days in rough terrain on four continents. Runners travel 150 miles per race.

What group of people is this study excluding?  Right, all the people who got injured training, and all the people who have quit running.  To really make the point, it also does not include people who possibly died from running.  This is a blatantly flawed study.

Ultramarathoners suffer injuries but most may be minor, a study finds

Supply, Demand and Marriage

They appear, for example, to focus more critically on the earnings potential of prospective mates. Because house size is often assumed to be a reliable signal of wealth, a family can enhance its son’s marriage prospects by spending a larger fraction of its income on housing.

For example, when Shang-Jin Wei, an economist at Columbia University, and Xiaobo Zhang of the International Food Policy Research Institute examined the size distribution of Chinese homes, they found that families with sons built houses that were significantly larger than those built by families with daughters, even after controlling for family income and other factors. They also generally found that the higher a city’s male-to-female ratio, the bigger the average house size of families that have sons.

Mr. Wei reports that many families with sons have begun to add a phantom third story to their homes, one that looks normal from the outside but whose interior space remains completely unfinished.

“Marriage brokers are familiar with the tactic,” he reports, “yet many refuse to schedule meetings with a family’s son unless the family house has three stories.”

Supply, Demand and Marriage

Does impatience make us fat?

Here is a good example of having to control for variables, in an attempt to best isolate the 2 factors being compared.

They controlled for other factors that might come into play, such as demographics and financial characteristics.

With everything else held constant, the researchers found that impatient individuals are more likely to be obese than people who are good at waiting. “We controlled for basically every variable in the kitchen sink,” says Courtemanche, the lead author and a professor at the University of Louisville. “It seems if you genuinely hold all else constant, the more patient you are, the less you weigh.”


Does impatience make us fat?

The Wallaby That Roared Across the Wine Industry

The Wallaby That Roared Across the Wine Industry

By the end of 2001, 225,000 cases of Yellow Tail had been sold to retailers. In 2002, 1.2 million cases were sold. The figure climbed to 4.2 million in 2003 — including a million in October alone — and to 6.5 million in 2004. And, last year, sales surpassed 7.5 million — all for a wine that no one had heard of just five years earlier.

Prima facie, it looks like exponential growth.  But, in the real world, nothing ever grows exponentially in perpetuity (except college tuition, it seems)   I looked up sales figures for other years online.  Let’s plot these numbers in a spreadsheet, and see how they look.  As you can see the growth started to flatten out after a few years. I actually couldn’t find the sales data for 2008, so this calls for a statistical regression (fancy words for “line of best fit”).  A linear regression only yielded r=.88, while a 2nd degree polynomial (quadratic) regression gave an r = .96.  This regression equation is f(x) = -.13x^2 + 513x - 515887

Do you notice the negative leading coefficient of the x2 term? Remember how this makes the parabola “frown”?  Well, this “inverted parabola” shape clearly reflects the flattening of the sales growth.  

By just looking at the trendline,what’s your estimate for the number of cases sold in 2008? Or, plug 2008 into the equation to get the exact coordinates on the red trendline:  f(2008) = -.13(2008)^2 + 513(2008) - 515887


Does taking LSD prevent crime?

Dr. Leary began conducting experiments with psilocybin in 1960 on himself and a number of Harvard graduate students after trying hallucinogenic mushrooms used in Native American religious rituals while visiting Mexico. His group began conducting experiments on state prisoners, where they claimed a 90% success rate preventing repeat offenses. Later reexamination of Leary’s data reveals his results to be skewed, whether intentionally or not; the percent of men in the study who ended up back in prison later in life was approximately 2% lower than the usual rate.

Well, the question is this:  Was the drop from 92% down to 90% explained by random chance, or did the LSD really have a statistically significant impact on reducing the crime rate?  Since the text does not provide a sample size, I will just use n=100 to do the math.

The calculations:

H0:  LSD takers had no difference in their repeat offense rates.
HALSD takers did have a difference in their repeat offense rates.


First, take stock of the given information:

n = 100 \\\\ p = .92 \\\\ \hat{p}=.90


Next, you calculate the standard deviation of samples of this size.

SD(\hat{p})= \sqrt{\frac{(.92)(.08)}{100}}=.03


To determine how unlikely your sampling result was, you calculate how many standard deviations away from the expected proportion it was (Z-score).

Z(\hat{p})= \frac{\hat{p}-p_0}{SD(\hat{p})}=\frac{.90-.92}{.03}=-.67


Then, you calculate the odds of getting this Z-score via the normal cumulative distribution function.  (What are the odds of this happening randomly?)  If it’s under 5%, then you reject the null hypothesis, because it’s unlikely this variation can be attributed to random chance.  ie: Odds are, the hair is indeed different.

p(Z \le -.67) = .25 = 25\%


Conclusion:  If the odds of being a repeat offender is 92%, then the odds of having 90% (or less) repeat offenders in a random sample of 100 men is quite likely.  The math shows that the odds of this reduction simply happening by chance (random variations) is 25%.  This is large enough (over 5%), that we can not assume the LSD had any true effect on reducing crime rate.  ie:  The 2% reduction was probably due to chance.  So, we accept the null hypothesis (H0):  In a sample of 100 test subjects, the LSD had no effect if it only reduced the repeat offender rate to 90%.

 So, do you have the same lingering question that I did?  How large would the sample size have to be in order for the 2% drop to not be an accident? (Recall, I just made up n=100).  Well, some simple algebra should answer this for us:

First, let’s determine the Z-score at the 5th percentile:

invNorm(.05) = -1.64


Let’s use that in the Z-score calculation to figure out what standard deviation we’d need

-1.64 = \frac{.90-.92}{SD}     (…SD = .012)


Backing this into the SD formula will help us solve for the sample size (n)
.012= \sqrt{\frac{(.92)(.08)}{n}}    (…n = 495)

So, if Timothy Leary showed a repeat offender drop of 2% with a sample size of 495, then we could say the LSD did have an effect.  Why?  Because that much of a drop only has a 5% chance of happening randomly.

Average hours of sleep (normal distribution)

How Little Sleep Can You Get Away With?


Nice example of a real life phenomena closely modelling a Gaussian normal distribution.  The average hours of sleep on a weeknight (for males) was 6.9 hours with a standard deviation of 1.5 hours.  Using this data, let’s calculate what percentage of men get a good night’s sleep.  The diagram indicates 27%.

Z = \frac{8 - 6.9}{1.5} = .73


normalCDF(.73,99) = .23 = 23\%



The Statistics of Gaydar

The Science of Gaydar

Lippa had gathered survey data from more than 50 short-haired men and photographed their pates (women were excluded because their hairstyles, even at the pride festival, were too long for simple determination; crewcuts are the ideal Rorschach, he explains). About 23 percent had counterclockwise hair whorls. In the general population, that figure is 8 percent.

Well, just how meaningful is this 23% discrepancy from the norm of 8%?  Maybe it’s just randomness, right?  Well, try the omitted calculations for yourself.  This is an example of a “hypothesis test” in Statistics.  The Null Hypothesis (H0) says that there is no difference in the groups.  The Alternative Hypothesis (HA) says there is a statistically significant difference in the groups.  In a hypothesis test, the essential question is this:  What are the odds that a sample varies this much from the expected percentage (proportion) simply due to natural random variation?  (For example, if you flip a coin 10 times, you usually get 5 heads.  Sometimes, however, you might get 6.  In fact, that should happen 26% of the time.  Nothing to be alarmed about.  However, the odds of getting 8 heads is only about 3%.  If you do get 8 heads, that’s rare enough to indicate the coin might be rigged.  Odds are you won’t do it again!)

So, for this hair test, we need to ask, “What are the odds of taking a sample of 50 guys and seeing that 23% having a counterclockwise whorl?”  We should expect to get 8%, as per the broad population.  If it’s very very rare to get 23%, then we might suspect there is a connection, and gay men do have different hair swirls than the broad population.  In Statistics, we define “very very rare” as under 5%.  In other words, if the odds that 23% of a sample of 50 have a counterclockwise whorl is under 5%, then it is statistically significant.


The calculations:

H0:  Gay men have no difference in their hair whorl orientation.
HA: Gay men do have a difference in their hair whorl orientation.


First, take stock of the given information:

n = 50 \\\\ p = .08 \\\\ \hat{p}=.23


Next, you calculate the standard deviation of samples of this size.

SD(\hat{p})= \sqrt{\frac{(.08)(.92)}{50}}=.04


To determine how unlikely your sampling result was, you calculate how many standard deviations away from the expected proportion it was (Z-score).

Z(\hat{p})= \frac{\hat{p}-p_0}{SD(\hat{p})}=\frac{.23-.08}{.04}=3.75


Then, you calculate the odds of getting this Z-score via the normal cumulative distribution function.  (What are the odds of this happening randomly?)  If it’s under 5%, then you reject the null hypothesis, because it’s unlikely this variation can be attributed to random chance.  ie: Odds are, the hair is indeed different.

p(Z \ge 3.75) = .000088 = 0\%


Conclusion:  If the odds of having counterclockwise hair whorl is 8%, then the odds of having 23% of 50 random men exhibit this trait is unlikely.  The odds of this happening by chance (random variations) is basically 0%.  So, we reject the null hypothesis (H0), and accept the alternative hypothesis (HA)



How does Netflix predict which movies you’ll like best?

The simplest way to predict your rating for a movie is simply to average everyone else’s rating of the movie.  (ie:  They can just give you the 10 movies with the highest average rating)  Of course, it can get much more complex that than, especially when NFLX was giving away a million dollars to anyone who could improve their rating algorithm!  The real meta of this problem is to determine other people who are most like you, and then use their collective ratings on movies you haven’t seen yet.

Neighborhood-based model (k-NN): The general idea is “other people who rated X similarly to you… also liked Y”.  To predict if John will like “Toxic Avenger”, first you take each of John’s existing movie ratings, and for each one (eg: “Rocky”), find the people who rated both “Rocky” & “Toxic Avenger”.  You then compare the ratings given to both movies by these people, and calculate how correlated these 2 movies are.  If it’s a strong correlation between their ratings, then “Rocky” is a strong neighbor in predicting John’s rating for “Toxic Avenger”.  You’ll weigh in the average rating given (by “Rocky raters”) to “Toxic Avenger” highly.  You do this for all the movies that John has already rated, and find each one’s strongest neighbor(s), and calculated a predicted “Rocky” rating from each movie John has already rated.  You then calculate a weighted average of all these predictions to come up with your ultimate prediction for John’s rating of “Rocky”.  Lastly, if you do this for every movie in the entire database, you can determine a “Top 10 suggestions” list for John.


Here is some general reading on the contest:

The BellKor solution to the Netflix Prize

This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize

The Netflix Prize: 300 Days Later

The Greater Collaborative Filtering Groupthink: KNN