Side note: I’m assuming anyone reading this has a basic understanding (“YouTube Level”) of how neural networks work (a la CGP Grey).

I decided to build a fully connected feed-forward neural network, since that seemed to be the easiest thing to do.

This involves an array (vector) input, which goes “through” two matrices in order to receive a result.

Each number in the matrix corresponds to a weight or bias of a neuron.

The way I think of matrices is a number modification tool. It takes some input, multiplies it by some number inside it, and then outputs it in a different shape. By doing this a few times, we have a neural network.

Take the input matrix from the MNIST database and feed it into the network. It produces a list of very wrong answers.

We also want to regularize this and calculate loss, which I admittedly just copy pasted. From what I understand, they’re finding the sum of all the scores and then dividing each score by the sum. This gives nice percent probabilities that add up to 100%.

Then they take the log of everything and flip the signs. Lastly, they use a formula that people say works pretty well to calculate loss. Ideally, the lower the loss the better our model will perform. Unfortunately, this is not always the case (due to overfitting).

1 | W1 = red * np.random.randn(784, 1024) |

This is just multiplying random numbers together, so there’s nothing too interesting to see here (except regularization). What’s interesting is backprop, explained next.

First we need to calculate the gradients. I haven’t taken Calculus at school yet so I took some time to learn what derivatives are. First let’s think of it in 2D.

In short, the derivative is the slope. If you add the slope to the x value, the y value always gets bigger. If you subtract the slope from the x value it always gets smaller.

The problem here is that this is only true for linear equations. The equations we’re working with have hundreds of variables (think y = (x+3)(x-4)(x-2) or similar). As such, we need to find the slope at a single point, and then add it to the x value.

The new issue is that often times, this can take us down. With a very wavy graph, adding or subtracting the derivative can make us cross the local maximum or minimum and do the opposite of what it should.

Adding or subtracting 0.001 times the derivative is more likely to behave as expected. That 0.001 is called the learning rate. Smaller numbers usually give better results - but require more iterations and longer training times.

Now we just need to do that over our whole weight and bias matrix.

First, find the gradient (that means derivative) of the loss function. I’m not sure exactly why (TODO: learn Calculus to figure out why), but subtracting one and then dividing by the shape of the matrix gives us the gradient of the loss function wrt (with respect to) the answer.

(Similar to how the derivative of y=x^2 is 2x wrt y.)

I tried to use it without this step and got bad results, so it seems to be necessary. On the off chance that someone is reading this, please let me know why this works.

Anyway, then all we need to do is calculate the gradients for all the weights and biases.

The gradient of W2 is h2 (transposed) times the gradient of the loss function.

The gradient of d2 is just the sum of the gradients of the loss function.

The gradient of the ReLU is just setting all gradient values less than zero to zero.

… and repeat for W1 and W2. Due to something called the chain rule, the derivative of x wrt z is equal to the derivative of x wrt y times the derivative of y wrt z.

Then I added the derivative times a small learning rate to all the weights and biases in a loop - and we have a neural network!

97.87%. Although a far cry from the 99.79% achieved in 2013 - it’s a start.

In the future I will try to bring it as close as possible to that 99.79% result given by those researchers. After reading their paper, it may even be possible to surpass their score.

I rushed through writing this (tests, other excuses), but I want to help other people learn. If you have any questions, feel free to contact me. At the very least (in the likely event I don’t know) I’ll try to find you a resource where you can find out.

I managed to break one of the rules I set for myself already! People who have taken cs231n will notice that I followed that tutorial. Unlike most tutorials - I feel like I actually now have a basic understanding of the topic.

I realize that I shouldn’t have boxed myself out of large learning opportunities like that one. It’s probably better to keep an open mind, especially when trying to learn something new.

I’m thinking I’ll take a break from neural networks and talk about something else instead in the next post.

]]>`helpers.py`

files which obfuscate simple tasks. I want to try to use my ability searching and solving problems to get around that instead.That said, here are a couple rules that I made for myself in regards to learning about Neural Networks.

- No “easy” tutorials or classes. Although easy to complete, I usually learn very little out of them.
- Documentation and academic papers are not only allowed, but recommended. Everything new comes in the form of code, documentation, or an academic paper, and I want to be able to read those.
- Python packages are also allowed, as long as they do not do what I’m trying to learn about.

In this post, I use a prebuilt implementation of a random forest classifier to get a baseline. This post is me getting my feet wet with the Python data science tooling.

First things first, find a small dataset to practice with.

After looking around for a while I found the MNIST dataset. It contains pictures of handwritten digits for character recognition. It’s also less than 100 MB, and as such will easily fit into RAM of any modern computer.

The dataset can be downloaded on Yann LeCun’s Website.

One thing to note is that `wget`

automatically unzips the gzip files, while leaving the extension (a byproduct of how the web works). As listed on the site, all I needed to do to get it to work was to remove the `.gz`

file extension.

The dataset format is fairly basic and is described on Yann LeCun’s Website. Unfortunately, for a high level language user like me, it was not immediately obvious how I could load this into Python to make it useful.

One interesting thing I noticed was that this dataset is big-endian, while most modern processors are little-endian. In practice, this doesn’t cause much of a problem for someone using Python.

I looked up the idx file format and found `idx2numpy`

, a package on PyPi that makes loading the dataset as easy as `idx2numpy.convert_from_file("train-images-idx3-ubyte")`

.

1 | train_data = idx2numpy.convert_from_file("train-images-idx3-ubyte").reshape(60000, 784) |

Note: The `.reshape()`

s become necessary later, as I will explain.

Getting a good look at how the dataset is structured seems like a good idea. To do so, I did a `.shape`

on all of my data.

Note: I am using a Jupyter notebook, so I don’t need to print my results. The results are shown within the notebook.

1 | train_data.shape; // Before resizing: (60000, 28, 28) |

Okay, seems pretty simple. We have 60000 images, with each image being 28 pixels by 28 pixels.

Okay, time to do some simple analysis with scikit-learn and a Random Forest Classifier (herein, RFC).

1 | from sklearn.ensemble import RandomForestClassifier |

Oops, without the `.reshape()`

function added on, it doesn’t work. A quick look at the Scikit-Learn documentation explains why. It is expecting 2D data in the shape (number, data). I thought about it for a while and figured that due to how RFCs work, it wouldn’t matter if I resized the 3D data into 2D space. Numpy makes this easy, all I needed to do was reshape it to (60000, 28*28) - and I have 2D data.

We get a 57% baseline for ~10 lines of Python. Not bad compared to the 10% accuracy given by random guessing.

The end goal of this is for me to implement a neural network without using high level libraries like TensorFlow or Theano that gets above 90% accuracy. The current state of the art is around 99.79% accuracy, and reading the academic paper once I’ve learned a bit about neural networks could help me get there as well.

]]>Wrong.

I’ve made a few blogs over the years, but never posted past my introductory post. Or, if I did post, I considered the post to be not good enough and deleted it.

The choice of blogging software was also painful. Wordpress is heavy, bloated, and every cool feature or plugin I wanted isn’t free.

So, what is my solution?

A static site generator (SSG), as the name implies, generates static sites. This means that all pages are created at build time on my machine, and all the server has to do is send pages to readers.

As staticgen.com writes on their about page:

The typical CMS driven website works by building each page on-demand, fetching content from a database and running it through a template engine. This means each page is assembled from templates and content on each request to the server.

For most sites this is completely unnecessary overhead and only adds complexity, performance problems and security issues. After all, by far the most websites only change when the content authors or their design team makes changes.

Thanks to the aforementioned website, I chose Hexo.

What methodology did I use to pick a SSG? First of all I went by theme support. SSGs with good theme support are probably more likely to have good theming support. Some SSG project pages don’t even mention themes!

I went in wanting to look somewhat like Kevin Kwok’s Blog. In my opinion, it has just the right level of simplicity and content. It feels busy, but not overwhelmingly so. It makes me want to read the posts (a very good thing).

I eventually realized I don’t have enough content to make that work, so I settled for a simple minimal theme instead.

This narrowed me down to Hexo and Hugo, both of which are in the top three on staticgen.com.

Next up I looked at the source. When something goes wrong (when, not if), being able to understand and fix it is a big issue for me. Hugo’s source slightly confused me. Hexo has a standard layout for all it’s files, and I felt like I could understand it if I needed to.

So it was decided, Hexo it is.

Bonus: The format for hexo and hugo posts are largely the same. If I want, I should be able to move from one to the other without much hassle.

So that’s the first problem solved, I have a simple medium to publish on. The second problem is not publishing articles at all/often.

I decided to remedy this by simply doing more “cool things” that I can write about, and being sure to write down every idea I have for what to post.

Here’s a few things I want to write about.

I am currently working through Hacker’s Guide to Neural Networks by Andrej Karpathy. Unfortunately, he abandoned the excellent guide (the first one to help me understand backpropagation and derivatives). It is also a bit wordy, so one of the things I could do is try to explain his concepts on my own.

The benefit to this is twofold. One, it helps me learn. I learn very well by teaching others. Two, it may help others learn. I hope I can help someone else learn too!

Sometimes I get sent a link, or I find one browsing around, and I find something interesting that I want to share with everyone. The fact that I can’t think of any right now means that they may not be as interesting as I think they are when I find them.

Simply put, connecting things to other things in order to do interesting/useful things? I may write about my DIY IoT project if I ever rehaul it.

How did (blank) get popular? Why is (blank) done in one language, but not in its parent languages? How is (song) using (thing) for (effect)? Sometimes I think about things like this, it could be fun to write it down.

And that’s it for today!

]]>