Sam Symons

Notes for Coursera Machine Learning, Week 1

Some time ago, I started Andrew Ng’s Machine Learning course on Coursera. I loved every bit of it, but I only got halfway through before I started a new job and it ended up falling by the wayside.

Since time is more or less on my side again, I’ve since started going back through it from week 1 with a couple friends. We’ve been taking notes as we go, which I’ll be polishing up and posting here!

These notes were written by myself and Rob Hemingray.

Linear Algebra Review

Matrices and vectors are used almost everywhere throughout machine learning. They are how we represent sets of values in a way that is efficient to compute upon — modern GPUs are designed to calculate this stuff extremely efficiently, so running machine learning algorithms on a graphics card is a great idea!

If you’re more familiar with programming than linear algebra, think of a vector as an array, and a matrix as a two-dimensional array.

Here’s an example:

\(\begin{bmatrix} 1 & 2 & 3 \end{bmatrix}\) ← vector

\( \begin{bmatrix} 1 & 2 & 3 \cr 4 & 5 & 6 \end{bmatrix} \) ← matrix

Matrices are measured by their number of rows and columns. The matrix above would be a \(2 \times 3\) matrix. Likewise, a vector is just a \(1 \times n\) matrix (or \(n \times 1\) depending on whether you want a row or column vector).

Addition and Scalar Multiplication

Vectors and matrices can be added and multiplied with scalars pretty easily. They can be added together if they have the same dimensions.

$$ \begin{bmatrix} 1 \cr 2 \cr 3 \end{bmatrix} + \begin{bmatrix} 4 \cr 5 \cr 6 \end{bmatrix} = \begin{bmatrix} 5 \cr 7 \cr 9 \end{bmatrix} $$

This works with multiplication as well. If you’re multiple a matrix by a scalar, you do it element-wise, taking each element and multiplying it individually. You want to be careful to mention whether you are multiplying element-wise with a matrix, because multiplying a matrix by a vector is a different thing entirely.

Matrix-Vector Multiplication

Let’s start with multiplying a matrix with a vector. In order for this to work at all, your matrix needs to have the same number of columns as the vector has rows. You’ll see why in a second.

$$ \begin{bmatrix} 1 & 3 \cr 4 & 0 \cr 2 & 1 \end{bmatrix} \times \begin{bmatrix} 1 \cr 5 \end{bmatrix} = Something $$

So we have our matrix and our vector. The matrix is \(3 \times 2\), and the vector is \(2 \times 1\). Notice: the matrix’s column count matches up with the vector’s row count. Let’s work through this multiplication and you’ll see why the element totals matter.

For each row in this matrix, multiply it element-wise by the vector and sum the result (this is the dot product). For \(\begin{bmatrix} 1 & 3 \end{bmatrix}\), the first row, multiply that by \(\begin{bmatrix} 1 & 5 \end{bmatrix}\) (the transpose of the vector we’re using) and sum the result. This gives us \(16\).

Now, do the same with the second row to get a result of \(4\). Finally, the last row gives us \(7\). As you go, each row you multiply becomes a new row in a new vector; here’s what ours looks like.

$$\begin{bmatrix} 16 \cr 4 \cr 7 \end{bmatrix}$$

So our \(3 \times 2\) matrix multiplied by our \(2 \times 1\) vector gave us a new \(3 \times 1\) vector. This might make sense! Our matrix had three rows, and we are creating a new row for each row we calculate!

Matrix-Matrix Multiplication

So matrix-vector multiplication is easy enough. What about multiplying a matrix by another matrix? As it happens, this is pretty similar to matrix-vector multiplication, except that you’re effectively multiplying your matrix by \(n\) vectors. Perhaps an example would help.

$$ \begin{bmatrix} 1 & 3 & 2 \cr 4 & 0 & 1 \end{bmatrix} \times \begin{bmatrix} 1 & 3 \cr 0 & 1 \cr 5 & 2 \end{bmatrix} $$

Going by our matrix-vector definition of multiplication, if we have a \(2 \times 3\) matrix then we want a \(3 \times 1\) vector to go with it, right? As long as the matrix columns match the vector rows, we’re in business. That’s still the case here! The only thing is, now we’re dealing with multiple \(3 \times 1\) vectors.

For the first row in the first column of the new matrix, we have to work out \((1 \cdot 1 + 3 \cdot 0 + 2 \cdot 5)\). For the second row of the first column: \((4 \cdot 1 + 0 \cdot 0 + 1 \cdot 5)\).

Putting this together, we get \(\begin{bmatrix} 11 \cr 9 \end{bmatrix}\).

Easy! Now you have to do the same with the next vector, \(\begin{bmatrix} 3 \cr 1 \cr 2 \end{bmatrix}\).

The final result gets us a brand new \(2 \times 2\) matrix:

$$\begin{bmatrix} 11 & 10 \cr 9 & 14 \end{bmatrix}$$

Remember how the previous matrix-vector multiplication gave us a \(3 \times 1\) vector? This is the same deal! We started with a \(2 \times 3\) matrix and a \(3 \times 2\) matrix, and we got a \(2 \times 2\) result. The outside integers represent the size of the result, and the inside integers have to be equal.

Linear Regression with One Variable

For linear regression with one variable, we effectively have a vector of some variables, and then a vector of their expected values. Let’s say we’re trying to predict house prices based on the size of the house in square feet (just one variable).

house_size = [1000, 1500, 750, 2000, 3500]
selling_price = [10000, 15000, 7500, 20000, 35000]

This is a very contrived example. The goal here is to find some function that can generalize these values: f(house size) = selling price . This function will end up being the equation for the line of best fit for the data. Keep in mind: the line of best fit changes based on the data. Some house prices may be different in one area to those in another, so the whole goal is to find the selling price based on the training set of data.

It’s important to note that function we find won’t be perfect in all cases. The idea is to just get as close to matching these values as possible — once we have a function which can do that, we can input any arbitrary house size to get a good selling price .

To start on this, we need a way to figure out how good our current equation is. Since the equation for a line is \(y=mx + b\) (in this case, \(y\) is the selling price and \(x\) is the house size), we need to find \(m\) and \(b\) — to do that we have to start with a way to determine the accuracy of our current \(m\) and \(b\) values.

Hypothesis Function

So, we have a vector of training examples, and then a vector of \(y\) (target) values. We want the line \(y=mx + b\). Here, \(m\) is some constant which we use to multiply by our training value to get a line of best fit. \(m\) is in a vector called theta, or \(\theta\). All theta is is a vector of extra variables which we have to determine ourselves to get it to estimate the output correctly. Ideally, each of our training variables multiplied by its corresponding theta value will result in the corresponding \(y\) value.

training = [1, 2, 3]
theta = [0.5, 2, 3]
expected_values = [0.5, 4, 9]

You see in the above case, theta is exactly what we want. Each value multiplied by the corresponding theta value gives exactly the value we expected. However, you usually don’t start with the perfect theta values. You might start with each theta set to 1, for example. From there, we can train our linear regression algorithm with the training examples to tune each theta value to be more accurate.

Our equation to guess at the price for a house given a size is called the hypothesis function. It takes in our variable \(x\) and returns its best guess, \(y\). Going back to theta, it can be defined like this:

$$ h(x) = \theta_0 + \theta_1 x $$

So we have our function which sums up the our theta values multiplied by our x value, plus the y-intercept value, represented at \(\theta_0\) here.

Cost Function

As mentioned, the goal is to get our hypothesis function to match the training examples as closely as possible. We do this by finding some way to choose \(\theta0\) and \(\theta1\) as well as possible. Our \(h_\theta(x)\) function should match the target \(y\) as closely as possible.

$$ h_\theta (x) - y \approx 0 $$

This equation works for a single example, but we want to make sure that all of our training examples are matched closely. Instead, we want to do something like this:

$$ \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 $$

We’re just taking the sum of the squared difference between each input (\(m\) is the number of training examples) and its expected output. Each difference is squared so that there are no negative values — adding a negative value to the sum is obviously going to be a bad time.

What we want to do is find the values for \(\theta\) and \(\theta_1\) which minimise the sum of the equation above. If we find the perfect values for theta, then the sum will be 0; there will be no difference between any of our hypothesis functions and the actual values.

That equation is called the cost function. Instead of having to write out the whole thing each time, we can define it like this:

$$ J(\theta_0, \theta_1) = \frac{1}{2m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 $$

We’re taking the sum and averaging it, with the \(\frac{1}{2m}\). Dividing a number by 2 is the same as multiplying it by the inverse of 2, which is \(\frac{1}{2}\). Since we have \(m\) examples, the inverse is \(\frac{1}{m}\). The fraction is then divided by 2 for reasons I’m not entirely sure of yet.

So we have our cost function. We pass in the values of theta that we have so far (remember, these are arbitrary and we’re changing them to find the best fit) and get back the cost of those values. The higher the cost, the worse our hypothesis is. If the cost is 0, then we nailed it.

Calculating the Hypothesis with Vectors

So if we have a vector of training variables, and then a vector of theta values, we want to calculate the dot product. However, If we have 3 training variables, then we need 3 theta values, in addition to the \(\theta_0\) value. An easy way to get around this is to add an extra element of value \(1\) to our training examples. Then, when we calculate the dot product, \(\theta_0\) will be multiplied by \(1\).

Gradient Descent

We now have a way to evaluate our hypothesis function, and see how well it performs using the cost function. This is all well and good, but we need a way to optimize our theta values so that we find the ones which minimize our cost function.

The idea of the gradient descent algorithm is this:

  1. Start with some guesses for our theta values
  2. For each iteration, change our theta values in a way that causes \(J(\theta_0, \theta_1)\) to be minimized until we end up at a minimum

So we have our two theta values to start with. What we do is we want to look at where we are with these theta values (that is, what our cost function outputs with these theta values) and see if we can make a step somewhere to reduce the output of the cost function. If there is a direction we can go to reduce the output, we take that step, and try again. Keep going until you can’t do any better!

Here’s gradient descent, as written by people smarter than I am:

$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

This is basically saying: for each theta value \(\theta_j\), take \(\theta_j\) and subtract the learning rate \(\alpha\) multiplied by the partial derivative of the cost function. Because the derivative is the slope of the line tangent to the function, subtracting its value will make sure we are heading in the right direction. If the slope is positive (as in, the point at \(\theta_j\) is increasing), we subtract it to go back and find a value where it is decreasing. Similarly, if the slope is decreasing, we subtract by a negative, which means our value heads in that direction — we want it to get smaller.

Let’s walk through some of these details, starting with the learning rate, \(\alpha\). The learning rate does nothing more than control how fast our gradient descent algorithm converges; it’s completely arbitrary. We don’t need to step by just the value of the derivative each time — for large data sets that would take too long. Instead, we can speed things up by multiplying that derivative by some learning rate. You do have to be careful with the learning rate; picking a value which is too small will take a while, whereas a large value might end up skipping right over the minimum values.

You also want to be careful about how \(\theta_j\) is updated. In our case we have two theta values, so we want to make sure they are updates simultaneously. By that, I mean calculate the new \(\theta_j\) value for each theta variable, and then update them. Don’t update \(\theta_0\) and then \(\theta_1\), because then your calculation for \(\theta_1\) will be using the value for \(\theta_0\) which has already been modified.

Gradient descent pop quiz!

Say we have \(\theta_0 = 1, \theta_1 = 2\). If we update \(\theta_j\) using \(\theta_j := \theta_j + \sqrt{\theta_0 \theta_1}\), what do we get?

Well, for each one we want to calculate its value plus the square root of both theta values added. So in each case, we get \(\theta_j = \theta_j + \sqrt{2}\), since we have \(1 \cdot 2\) inside the square root. The thing here is to notice that we are using the original theta values for both calculations, not the post-calculation values.

Gradient Descent Intuition

When we want to find the derivative of a function, we have to look at see whether or not it takes more than one parameter. If we’re just dealing with one parameter, we take the derivative of that function and call it a day. However, in the case of the cost function, we never have just one parameter — there is always \(\theta_0\) and then any other theta values. This is why we take the partial derivative — we care about the derivative for some \(\theta_j\), and can ignore the other parameters.

One other thing to thing about is what the derivative is when you’re already at the local minimum. When you’re at the lowest point available, the slope of that point is going to be \(0\) — one step of gradient descent is going to do nothing, so you’re done!


  • The hypothesis function takes an input value and returns its best guess for the output. For house prices, it would take the size of a house and give us a price estimate.
  • The cost function calculates the difference between the hypothesis and actual values and sums them. The higher than value, the worse our hypothesis is.
  • Linear regression is all about figuring out the line of best fit for our data, so that we can make future predictions.

Bayes' Theorem

One of the most useful and interesting theorems in statistics is Bayes’ Theorem. I like it a lot because it can be used to solve interesting problems with very little effort — it boils down to one equation! I wanted to go over it and try to provide some intuition into how it works, and then why you would want to use it.

Conditional Probability

First, let’s walk through some simple examples to get warmed up. What is the probability that a fair coin, when flipped, will result in heads? Easy enough: 50%!

$$ P(Heads) = 0.5 $$

There are two possible outcomes, heads and tails, each with an equal probability, so the chance of getting either one is 50%. 1 actual outcome divided by 2 possible (and equal, in this case) outcomes = ½ = 0.5. This is Probability 101, but it’s worth recapping before jumping into conditional probability.

Conditional probability is where you start to think about the chances of something happening, provided something else has already happened. Let’s look at coins again, and then try something more interesting. What are the chances of getting a heads, provided you just flipped a coin and got tails? This isn’t much of a stretch from our previous work. A real statistician might write that like this:

$$ P(Heads|Tails) = P(Tails) \cdot P(Heads) $$

This is a fancy way of saying the probability of heads, given that we just got tails, is the chance of getting tails multiplied by the chance of getting heads. You have a 50% chance of getting tails, and then after that you have another 50% chance of getting heads, so the total here is 25%, or ¼.

So that makes sense. There was a 50% chance of getting a result, and then a 50% chance on top of that of getting another result, leading to 25%. This math works because the probability of a coin flip is independent. Flipping a coin and getting heads does not affect the next coin flip, despite what gamblers may try to tell you. If I flip a fair coin and get three heads in a row, what are the chances of getting heads on the next flip? It’s still 50%!

Even More Conditional Probability

The coin example is a bit boring, so I’ll try something else. Statisticians love marbles for some reason, and I’m in no position to argue, so say we have two bags of marbles. The first has 3 red marbles, and 1 blue marble. The second has 2 of each.

  • Bag 1 = 3 red marbles, 1 blue marble
  • Bag 2 = 2 red marbles, 2 blue marbles

Here’s what I want to know: if I pick out a marble out of one of these bags at random, and it is red, what are the chances it came from bag 1? Or, in fancy notation:

$$ P(Bag 1|Red) = Something $$

Given I have a red marble, this equation is the chance of the bag it came from being bag 1. Calculating this the other way around is straightforward — if you chose a marble from bag 1 and it happens to be red, the chance was ¾, so \(P(Red|Bag 1) = 0.75\). Similarly, picking a red marble from bag 2 has a chance of 50%, since there are an even number of red and blue marbles. But how do we figure out which bag we chose? This seems kinda difficult.

Bayes’ Theorem

Everything is fine. Let’s just work through this marble example, and see how far we can get. We want to know the odds of us picking bag 1, knowing that the marble we have in our hands is red. To even think about picking a marble, we had to choose a bag at random. There are two bags, and each is just as likely to be selected, so there is a 50% chance of getting bag 1: \(P(Bag1) = 0.5\).

From here, each bag has a different probability of picking a red marble, since there are different numbers of red marbles in each. We care about bag 1 right now, so as mentioned earlier, \(P(Red|Bag 1) = 0.75\). So far we have this:

$$ P(Bag1) \cdot P(Red|Bag 1) $$

Since we picked a red, we have to consider all red marbles, or \(P(Red)\). Since we are picking a few red marbles out of all possible red marbles, we can divide the previous value by the probability of red, like so:

$$ \frac{P(Bag1) \cdot P(Red|Bag1)}{P(Red)} $$

Now we have the chance of choosing bag 1, multiplied by the chance that the marble picked inside bag 1 was red, divided by the chance of picking a red marble at all.

Surprise! That’s Bayes’ Theorem! Here it is from Wikipedia:

$$ P(A|B) = \frac{P(A) \cdot P(B|A)}{P(B)} $$

That’s honestly all there is to it. Pretty neat, huh? This theorem lets us figure out the probability of an event being in some category, given that it occurred.

How Does This Even Help Me?

Alright, so maybe for some reason you’re not sorting marbles regularly. Bayes’ Theorem is equipped to handle more than that. One popular use of the theorem is for classifying email into spam or not-spam.

As I mentioned, the theorem helps us figure out the probability of an event being in some category, given that it occurred. In this case, given that we received an email, what are the odds that it’s spam?

In real life, the way we figure out whether an email is spam or not is just by reading it. If it talks about Viagra or distant relatives suddenly bequeathing large amounts of money (sometimes both?), it usually ends up in the trash. All we really have to go on is the text content of the email, so that’s exactly how these email classifiers work. In the context of Bayes’ Theorem, it looks like this:

$$ P(Spam|Words) = \frac{P(Spam) \cdot P(Words|Spam)}{P(Words)} $$

The probability of an email being spam, given that it had these words in it, is the probability of getting a spam email multiplied by the probability of seeing these words in a spam email, divided by the odds of getting these words in any email. This is why sometimes you’ll get spam email slip through your filter, and why spammers are regularly changing the language they use: it’s all about probability, so by being unpredictable, you get into a user’s inbox.

Actually building one of these classifiers is obviously a little more involved, but there is plenty of material out there to dig into. The point I wanted to get across is that Bayes’ Theorem is really cool, and can help you solve interesting problems! Learn it! Use it!

Rendering Math with KaTeX

The other day, I was looking into how easy it is to render LaTeX via a Markdown document. MathML isn’t yet widespread enough to use reliably, so instead I started looking at some of the third-party libraries available. I had looked at KaTeX from Khan Academy a few times in the past, and was happy to find that it was exactly what I wanted.

To add support, all you need is KaTeX, auto render support, and the CSS file:

<script src=""></script>
<script src=""></script>
<link rel="stylesheet" href="">

Then, in Markdown, you can write your LaTeX within $$ blocks.

a = \\sqrt{b^2 + c^3}

This renders like so:

$$ a = \sqrt{b^2 + c^3} $$

The auto render plugin can be configured to ignore certain tags or check for custom delimiters; there’s more about that in the GitHub repo. KaTeX can also take a LaTeX string and spit out formatted HTML, which is probably the better option if you’re happy to do a little more work for raw performance, but as it is, this works great.

Another option for this is using Kramdown’s LaTeX support with MathJax, but I had some issues getting it to render correctly in some cases (and it seems a bit slower than Khan Academy’s alternative).

Exploring Swift, Part 2: Installing Custom Toolchains

As you’re making changes to Swift, you’ll want to test them out by using your new version of the compiler and writing some programs with it. The most convention way to do this is to provide the system with a new version of Swift to use.

Some of the projects in the Swift ecosystem (like the package manager) may even require that you have a more recent version than that which Xcode provides. As Swift 3.0 evolves, the other projects evolve with it, and these changes can happen faster than Xcode betas are released.

What Are Toolchains Anyway?

In recent versions of Xcode, you may have seen mentions of toolchains. You might have even seen people talking about the TOOLCHAINS environment variable. It’s important to know what this is all about, since toolchains are the best way to use your own homegrown version of Swift.

A toolchain is little more than a collection of binaries. It contains the Swift compiler, LLDB, along with a few other tools and an Info.plist with some information about itself. It’s a way of containing everything you need to build a Swift project in a single directory.

The way Xcode manages these is by putting them all into Library/Developer/Toolchains and then pointing the swift-latest.xctoolchain symlink (in the same directory) at whichever one is currently active. It also uses this list to populate the Toolchains screen in Xcode’s preferences.

Building Swift

As a toolchain is just a directory with some Swift binaries in it, building a toolchain of your very own isn’t too scary. To kick things off, cd into your Swift directory and bring the project up to date:

./utils/update-build --clone-with-ssh

Before building, make sure you xcode-select to your path. The build process looks for the 10.12 SDK, which is bundled with Xcode 8. Without this, you won’t be able to build!

You should be set to build Swift now.

./utils/build-script -R --llbuild --swiftpm

Creating a toolchain is pretty straightforward. I found a great mailing list thread linking to this build script from Daniel Dunbar. As Daniel mentioned, this script was written for his personal use; I created a modified version which can be easily updated for any system.

This script can be run after compilation to build a swift-dev.toolchain directory.

Using Custom Toolchains

With your custom swift-dev toolchain in hand (it will be in the same build directory as Swift, Ninja-Release), head over to Libary/Developer/Toolchains and drop it in there. Now you can tell xcrun which version of Swift to use.

export TOOLCHAINS=swift-dev

After setting the TOOLCHAINS environment variable, the system will look in Libary/Developer/Toolchains for a corresponding version of Swift instead of using the toolchains that come packaged with Xcode. Now you’re ready to use your new compiler!

› swift --version
Swift version 3.0-dev (LLVM 505155ac3d, Clang f8606ef4b8, Swift b4cba58330)
Target: x86_64-apple-macosx10.9

Perfect! The swift binary prints the SHA for each dependency its using, so this should line up with the latest commit you built from.

← Previous Page