Algorithms Computer Programming Statistics

How does Netflix predict which movies you’ll like best?

The simplest way to predict your rating for a movie is simply to average everyone else’s rating of the movie.  (ie:  They can just give you the 10 movies with the highest average rating)  Of course, it can get much more complex that than, especially when NFLX was giving away a million dollars to anyone who could improve their rating algorithm!  The real meta of this problem is to determine other people who are most like you, and then use their collective ratings on movies you haven’t seen yet.

Neighborhood-based model (k-NN): The general idea is “other people who rated X similarly to you… also liked Y”.  To predict if John will like “Toxic Avenger”, first you take each of John’s existing movie ratings, and for each one (eg: “Rocky”), find the people who rated both “Rocky” & “Toxic Avenger”.  You then compare the ratings given to both movies by these people, and calculate how correlated these 2 movies are.  If it’s a strong correlation between their ratings, then “Rocky” is a strong neighbor in predicting John’s rating for “Toxic Avenger”.  You’ll weigh in the average rating given (by “Rocky raters”) to “Toxic Avenger” highly.  You do this for all the movies that John has already rated, and find each one’s strongest neighbor(s), and calculated a predicted “Rocky” rating from each movie John has already rated.  You then calculate a weighted average of all these predictions to come up with your ultimate prediction for John’s rating of “Rocky”.  Lastly, if you do this for every movie in the entire database, you can determine a “Top 10 suggestions” list for John.


Here is some general reading on the contest:

The BellKor solution to the Netflix Prize

This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize

The Netflix Prize: 300 Days Later

The Greater Collaborative Filtering Groupthink: KNN



Set Theory

Links: Humor with Venn Diagrams

Le Grand Content

On the Origin of Venn Diagrams


How They Check if a Credit Card Is Valid

Before the era of ethernet, TCP/IP, and packet verification, electronic data was transmitted over telephone wires.  (Think back to the days of AOL & modems, and that screeching noise when you connected.)  A big problem with this method was the risk of external interference garbling your signal.  What if a bird flew into the phone wire?  Or if someone picked up the other line?   Or it started raining?  Any external interference could result in the  information being sent at that moment to be garbled.  So, if you were transmitting something like a credit card number, how would the receiver know that it wasn’t garbled?  (For example, what if a 3 was garbled into a 4?)   This was a problem solved at IBM back in the 1950s.  Of course, the same issues arise when human error is introduced.  (What if you are reading the credit card aloud over the phone, and the other person types in one of the digits incorrectly?)

For the rest of this post, I am merely introducing a lecture I attended called “Identification Numbers and Check Digit Schemes”, by Joe Kirtland.  Thanks to Joe for sharing his full PPT slides with me (link below).  It talks about the checkdigit systems used to validate credit cards, bar-codes, ISBN numbers, currency serial numbers, etc.  It’s a great real-world application of Algebra, Geometry, and algorithms.

Here is a very crude example that illustrates the concept:  Let’s say you want to transmit the number “34515”.  One validation algorithm requires that the sum of the individual digits be divisible by 10 (mod 10 = 0).  Currently, the sum of the digits is 18 (3+4+5+1+5=18)  So, to make the sum divisible by 10, you just tack on a 2 at the end , and transmit “345152”.  The receiver of the data is told to ignore that last number, which is called a check digit.  The receiver then adds the numbers up and sees if the sum is divisible by 10.  If even one number was changed, it won’t work*.  If the final number he gets doesn’t check out, something went wrong, and you need to resend.

*Question:  This method is not foolproof, can you think of some reasons why?

If this topic interests you, check out the following slides, where Joe explored check digits as they relate to UPC codes, ISBN numbers, credit cards, and serial numbers on various currency.

Click here to view Joe’s full PPT lecture


Algebra Puzzles

Algebra Puzzle: Married Men

In a room full of men, ¾ of them are married.
There are 6 more married men then unmarried.
How many married men are there?

Set up the equations…

Algebra Engineering

What’s more likely to break down: a car with 200 miles or 20,000 miles?

In manufacturing/engineering, there is a concept known as the bathtub curve.  In theory, something that is brand new (including a human being!) is more likely to have failures than something that is a little older and has worked out those early kinks.  Of course, once the product gets old, you’ll start having new reasons for failure (things wearing out).

I don’t have much to add on this topic, but I created this post this for two simple reasons:

  • First, it’s a nice example of a Cartesian graph that makes intuitive sense.  The aggregate blue curve indicates that new things can be lemons, then they work smoothly, and then they wear out and start breaking again.
  • Also, I think its a great example of an authentic real-life piecewise function.   As you can see, it has three very different sections.


Question: In terms of cars, do you think the blue curve above would be so symmetric?



Projectiles: What’s the optimal angle at which to throw something? (to maximize distance)

Wow, a real life formula that uses the double angle trig. identities!  As you can see, the distance a projectile will travel is a function of:  velocity, gravity, and the launch angle.

First, a quick fraction review:  First, recall that \(\frac{1}{1000}\) is a lot smaller than \(\frac{1}{10}\).  Conversely, we can also agree that \(\frac{1}{100}\) is a lot smaller than \(\frac{99}{100}\).  ie: The bigger the denominator (and/or smaller the numerator), the lower the value of the (positive) fraction.

So, since v is in the numerator, the distance traveled (d) increases directly with velocity (in a big way, since it’s squared)  Next, since g is in the denominator, the distance traveled decreases as gravity increases.  (Makes sense, right?)

This is a graph of all Sin(x) values from 0 to 360.  The x-axis is divided into quadrants (0, 90, 180, 270, 360). Notice in the graph that Sin(x) rises from 0 to 1 as x rises from 0 to 90 degrees.  Then, it drops from 1 back to 0 as x rises from 90 to 180 degrees.

Refer back to the double angle \(Sin(2\theta)\) in the original formula up top.  So, as x rises from 0 to 45 degrees, 2x actually rises from 0 to 90, and the \(Sin(2\theta)\) value is increasing.  But, as x continues to rise from 45 to 90 degrees, 2x rises from 90 to 180, which means the \(Sin(2\theta)\) value is now decreasing.

So, what’s the ideal angle to throw something?  The one that maximizes the value of \(Sin(2\theta)\), since it’s a multiplier in the projectile formula.  Well, as you can see in the graph, Sin(90) = 1, the highest possible value for Sin(x).  So, the ideal launch degree is x = 45 (which puts 2x at 90).

So, now you know why the ideal angle in these video games is 45 degrees, and have an inkling of how programmers create classic games like these:


Quitting While You’re Ahead? Random Walks & Markov Chains

Question:  What if you walked up to a “fair” casino game (50/50 odds of winning) that pays even odds with $10,000, and said you would quit as soon as you’re up $1,000 ? (ie: You either walk out with $11,000 or keep playing until you lose everything)  What are the odds of you leaving the casino with a $1000 profit?

Markov Chain can virtually simulate many random walks of this experiment.  With a large enough sample size, you can get an accurate sense of the odds of walking out with $1,000.

The following code runs this simulation as many times as you want, and tells you how many times you got to $11,000.


use Getopt::Long;
use strict;

GetOptions("f:s", "debug", "v");

my $current_round = 0;
my $starting_amt = 10;
my $lost = 0;
my $won = 0;
my $current_amt;
my $x;
my $low_bound = 0;
my $high_bound = 11;

while ($current_round < 100) {
while ($current_amt > $low_bound && $current_amt < $high_bound) {

		#generate a 1 or 0
		$x = int(rand(1)+.5);

		#adjust to either +1 or -1
		if ($x == 0) {
			$x = -1;

		# update total
		$current_amt -= $x;

		# print current total
		print "$current_amt ";


	print "\n";

	#increment rounds won or lost...
	if ($current_amt == 0) {
	} else {

	#reset current amount for next round
	$current_amt = $starting_amt;


print "\nWon = $won\n";
print "Lost = $lost\n";

Answer:  Doing 100 rounds is a large enough sample size to get an accurate result.  On average,  you’ll walk out with $11,000 about 90% of the time you try this experiment.  Why is this still a bad idea?  In practice, gamblers rarely quit while they are ahead, and you still have that 10% odds of a total washout.  Recall, the expected value of this game is still break even.

Here is a sample output of 10 rounds:

Won = 9
Lost = 1


The Flawed Logic of the Low-Fat Movement

Remember truth tables?  If p, then q.  Inverse, Converse, Contrapositive, etc.  First, a quick review.  Take the true statement “If you paint, then you’re an artist”

  • Inverse: “If you don’t paint, then you’re not an artist.”
    (False.  What if you just sculpt?)
  • Converse:  “If you’re an artist, then you paint”
    (False.  What if you just sculpt?)
  • Contrapositive: “If you’re not an artist, then you don’t paint.”
    (Ok, true, because if you did paint, you’d be an artist)

The point is that only the contrapositive is logically equivalent to the original statement.

Now, take a look at the statement “If you eat fat, then you’ll increase the odds of cardiovascular disease.”  What’s the inverse?  “If you don’t eat fat, then you will not increase odds of cardiovascular disease.”  Well, as you know, the inverse is not logically equivalent to the original statement, and therefore can’t be assumed as true.  In this example, limiting fat can lead to eating more carbohydrates (while keeping protein intake constant), which may be linked to cardiovascular disease


Watch 29:15 to 31:23 for this example of the incorrectly assuming the inverse is also necessarily true.

By the way, this lecture was the subject of a recent NYMag article titled Is Sugar Toxic?


SAT score vs. Ability to Save For Retirement

Here’s a link to a article that is a follow-up to the famous Marshmallow study of the 1960s.  Turns out, you can correlate SAT score and the ability to save for retirement as early as age two!  Both require: delayed gratification, self-control, discipline, obedience, etc.

According to Mischel, this view of will power also helps explain why the marshmallow task is such a powerfully predictive test. “If you can deal with hot emotions, then you can study for the S.A.T. instead of watching television,” Mischel says. “And you can save more money for retirement. It’s not just about marshmallows.”

Ted Talk