Musings++

Something something programming

"The disk is slow" is one of those things that most programmers take for granted. Yes it is slow given the speed of other components. But rarely have programmers taken the time to dig into WHY the disk is slow and what that actually means. Yet, doing so can lead us down some interesting rabbit holes.

What is slow?

For a while now the speed of a hard-drive was measured in RPM or Revolutions Per Minute. This is an indication of how quickly the disk can spin. It is common now-days to see drives advertising 7,200 rpm, or 10,000 rpm or even 15,000 rpm. How fancy.

Now, the disk itself is split into a couple major components.

  1. The disks that data is stored on
  2. The read/write head

These disks are where the data is actually stored, and when you see a number like "7,200 rpm" what you are seeing is how quickly these disks can spin. In a simplified manner, what happens when you "write something to disk", is that the disk spins to an empty point, and the head beings to write. Likewise, when you "read" data from the disk, it spins to a designated point (the "start" of the data) and the head begins to "read" the data until it is done.

Lets walk through a theoretical "disk". Say your "disk" can hold 8 units (00000000) of storage. We are going to perform a few actions on this disk.

  1. You write 'a' twice - aa000000
  2. You write 'b' 3 times - aabbb000
  3. You delete the two 'a' - 00bbb000
  4. You write 'c' 4 to,es - ccbbbcc0

See your drive is smart enough to know that, even though there isn't enough "contiguous" space, there is still enough space scattered around on the drive to store your 4 units of 'c'. What happens is that your drive will spin to a free location, and let you start writing. When you run out of contiguous space, it will spin to a new location. This results in your data actually spread out all over your drive instead of next to each other. This is a good thing, of course. It means that you can use all the space on your drive without worrying about WHERE things are stored.

But, it also means that instead of the disk spinning just ONCE to get to the start of your data, it actually needs to spin twice.

As we grow our disk size from "8 units" to hundreds of gigabytes as most modern drives today have, we run into a problem - there is no guarantee that the data we need will be next to each other. Infact, there is a high probability that we will need to keep jumping about on the disk to be able to read ALL the data we want. Our data ends up "fragmented".

Data Fragmentation

The result of all this fragmentation, is that things just get slower over time. Unfortunately, the eventual degradation of data storage efficiencies is never attributed to the hard-drive because users don't actually USE the hard-drive directly - they go through the OS which is supposed to manage these things. As a result, the eventual experience of degrading of performance is chalked up to "my Windows is slow". Operating systems combatted this eventual degradation by shipping with a defragmenter, which does exactly what you'd expect. It takes all these scattered fragments from around your drive, and puts them next to each other. This reduces the overall amount of seeks necessary to retrieve necessary information, thereby making things speedier.

But that's an expensive (resource wise) thing to do. In order to defragment a system, the program needs to

  • find an application that has its data fragmented
  • copy the data between the fragmented data to memory or some other free space on the drive
  • move the data closer together.
  • repeat

Lets go back to our previous scenario, and see how defragmentation could work:

Initial State: ccbbbcc0  
Step 1:        cc0bbccb  
Step 2:        cccbbc0b  
Step 3:        ccc0bcbb  
Step 4:        ccccb0bb  
Step 5:        cccc0bbb  

Obviously this is not optimized in any way, so there's plenty we could do to speed this up. But this is essentially what your drive is doing. It's like putting a deck of cards back in order after you've been shuffling them. Sure it's possible, but it just takes some time.

A much better idea, would be to try and optimize STORING this data in such a way that would reduce fragmentation. That is, maybe we keep data that is related next to each other on the drive when we WRITE the data the first time. That way things don't get as fragmented as quickly.

Blocks and Pages

The first step in ensuring data is kept close together is the idea of "blocks". Basically the filesytem that actually interacts with the hard-drive will define a "block-size". The block size is basically a measure of how much data will fix in a block, and the filesystem reads/writes in blocks instead of individual bytes. Think of it this way: If your hard-drive was a piece of lined paper, we were originally writing things down one word per line. With blocks, we basically said "well, we'll just write until we reach the horizontal end of this line". So now instead of one word per line, we have a few words per line. Perhaps, we could say, we have have one sentence per line.

Using our previous 8 unit drive example, we could sub-divide that into blocks of 2 units, making it look like this:

Drive State: [c,c][c,c][b,b][b,0]  

Now when we want to read all the c values, we have two seeks instead of 1 per record. This is already a big improvement (we've reduced seeks by 50%), but we could probably reduce it even more. Since the filesystem has to expose a standard block size to all applications, systems that have to have a high amount of HDD I/O need an alternative. The easiest thing to do, is take the concept of a block containing records and create another abstraction: a page containing blocks.

At it's "lowest" level, a relational database deals with "pages". Pages are really just collections of the data that you are storing. Relational databases (non-relational databases might as well, but I haven't really dug into the internals of a lot of them) utilize this concept of a "page" to further decrease IO latency with the disk. Rather than dealing with the storage of individual records or information, it groups records together into a "page" and uses that. It will read/write a whole page. This allows them to capitalize on the assumption that when you are reading/writing data the data you are accessing is probably next to other data that you also require.

They even go so far as to let you customize this via "clustered keys". A clustered key is just a mechanism to allow you, the database administrator, to define HOW the database orders the data within pages. As the administrator, you know the data you are trying to store, and the primary ways that it might be accessed. Databases give you the ability to say "well, group all these records together on the disk by the values in this column". This creates pages that are grouped around a particular value (a userID for example), so that all records with that same value are near each other.

Think of a database where you want to associate a list of items with a user. You have two tabes, users and items.

+-------+    +---------+    
| users |    |  items  |    
+-------+    +---------+        
| id    |    | id      | 
| name  |    | user_id | 
+-------+    | name    | 
             +---------+    

It would make sense to create a clustered key around the userID in the items table. This allows us to keep all items that belong to a single user in the same page, or group of pages, on the disk. This way, when we try and retrieve the items for a user, the database management system can fetch all the pages related to this user, stick them in memory, operate on them, and then write them all.

Databases are very intricate systems, and I don't want you leaving thinking there isn't a whole lot more to this whole concept. This is a HUGE simplification of what the database is actually doing, but it should provide you with an understanding of why it is doing some of that at a storage level.

The problem with blocks

The block system, however, is not without its own problems. By using "blocks" we've introduced a bit of wasted space into our storage. Lets go back to our block example:

Drive State: [c,c][c,c][b,b][b,0]  

That trailing 0 in block 4 will remain empty unless we add more b data. We will be unable to add any more c values but we'll be able to add more b. In fact, your drive will appear full to you because at the operating system level, it has no idea about the intricacies of your data storage. It just knows that these blocks are in use. So your 8 unit drive, has suddenly become 7 units.

That kind of sucks, and is actually a fundamental problem with "blocks". As long as you have data to write, blocks are great, but they will almost always result in the LAST block in a segment not being completely filled. This is natural of course, since whatever application is using that space generally doesn't care to know (nor should it!) about the block size it needs to be using. The result of this is that the more "small files" you have on your drive, the more "slack space" you have on the drive - space that isn't being used for anything, but is still seen as "used".

So now we come to a decision, either we just leave that space empty and accept it as part of the operating costs, or we try and figure out how to utilize it. Engineers, un/fortunately, are quite obsessed with performance. These "tails" (the last block) are inefficient, and could probably be removed with a bit of smart thinking. This results in two possible ways to resolve this problem.

  1. We allow the filesystem to support variable block sizes
  2. We figure out how to use that tail block for something useful

The first way, variable block sizes, is something file systems like ZFS utilize in an attempt to have more efficient storage. Since you know the kind of data you will be using the drive for, ZFS will let you specify your block size. If you know you have a lot of small things that need to be stored, drop the block size, likewise, increase if you have large things. It even has some magic like block level compression to try and use those blocks to their fullest. It is a very simple idea - and as we know, the simple ideas are the hardest to implement!

The second way, is another simple solution to the problem. If we know we have a bunch of tail blocks that are half-filled.. why don't we just combine them? That way we aren't creating a new tail block, but are instead re-using another tail block. This would result in another seek to read/write this data, but it ensures that we are using this disk to its fullest capacity. File systems like BTRFS will combine multiple tail blocks. The reason this is so effective is because the average block size is actually some multiple of 512 bytes. If you think about it, a text file might be a couple bytes? In a traditional file system that's 1 block per couple bytes. That's a heck of a lot of things you can stuff into a single block at that rate!

Changing the game

As you can see, we've put a lot of work into reducing the seek time for hard-drives. They've been such a fundamental component of computing that it was a requirement. But, what if you could just ignore seeking entirely? What if there was a way to almost instantly seek? In computer science we refer to this as O(1). That is the size of the data we are looking through is irrelevant - we can access any section of the data as quickly as any other. Welcome to the world of solid state storage. Solid state storage utilizes electronics instead of mechanical instrumentation. That is, instead of a spinning disk and actuators for the read/write heads it used electrical circuits. By removing the mechanical parts, it eliminated the "seek" time of disks that we find so slow. The only problem was that it was expensive and hard to make ENOUGH storage this way. We could easily make hdds that were several gigabytes, but were struggling to make solid state drives at megabytes. It just couldn't keep up.

Until it could.

Now days solid state storage devices are relatively cheap and large enough for the average user. They a whole bunch of problems caused by mechanical components. They produce less heat, less vibrations, and they are a lot faster. In fact, for a lot of work-loads, it's silly to rely on hard-drives when you can get so much better performance from solid state storage.

How interesting

It's crazy to think of all that we've accomplished because of that little mechanical hard-drive. But what's crazier is that we are only able to see this in retrospect. No one was able to see what the result of spinning disk drives would be. No would thought that we would invent so many different file systems to solve the problems. That we would make so many advancements in technology just to store MORE data on the drives. At the time, they were just better than tape. They were simply a step in the chain, that in retrospect, was pretty cool.

Follow the conversation at HackerNews https://news.ycombinator.com/item?id=13091192


Anachronistic Programming

I've decided to repost this article that I originally posted on September 17, 2013 while cleaning up the blog archives from when I was using Fargo.

I want to show you a piece of code. Something that's touted whenever this language is spoken of, and everyone seems to be able to pull out of their ass. I want you to look at it, understand it and then see how far down the execution chain you can take it. I'm not talking about if you can debug the app, set a breakpoint and then step through it. I want you to sit there, hands off the keyboard and go through the request cycle that needs to occur for this particular piece of code to execute and run as you expect it to. Spare no detail.

Did you get far enough? Did you get down to the HTTP protocol? The packet dance that happens before the actual request from a browser is sent? Or did you stop at "Browser makes a request to the IP address"?

See the problem today isn't that the computer is this magic box that only a few can understand. It isn't relegated to the guys in beards and thick glasses. It isn't just for the geeks and nerds. To be a programmer meant that you needed to also understand the specific hardware stack that you were working on. The exact chipset, the exact instructions available to you. The exact specs on the memory and video controllers.

Today, computers have become common place. And in order for it to get to this stage a few things needed to happen. The first one being "It just works". That's the basis of the consumer computer. With no fiddling, no worrying about any kind of internals, you should be able to get up and running in no time.

But you're not just a typical consumer. You're a programmer. And unfortunately, this idea of "It just works" has found its way into programmers minds everywhere. You don't need to think about the HTTP protocol, "It just works". You don't need to worry about Little vs Big Endian - "It just works".

Until it doesn't.

I think the problem with programmers today is quite simple. We've been lead to believe that we can rely on certain things within the system. Which, is great. I mean, if we couldn't rely on the HTTP protocol where would we be today? But this reliance has led to an entirely new problem - "I don't care". Programmers today don't need to learn about how the protocol works because "It just works". They don't need to think about it. And I think that has lead to an entire generation of programmers who don't understand the fundamentals of programming. They don't understand that the code they type into their fancy IDE's is really powered by the ideas of a few people and run on hardware. There's a severe disconnect between hardware and software and that is hindering them without knowing it.

Don't get me wrong, I'm not trying to say that I'm some incredible programmer - far from it. I'm actually a terrible programmer, because programming isn't all software and algorithms. There's a hardware component to it that's overlooked way too often. The things you're doing with code you're RELYING on the hardware to accomplish.

Don't you think you should at least have a vague understanding of how it works?


Becoming a web developer

The web is a big deal. Like, a HUGE deal. And the people that make the web have been thrown from their basements into the limelight. They've been lauded and applauded for being on the forefront of the technological innovation. But everything has its cost. Jonathon Hill posted today about the cost of being a web developer. I think Jonathon missed an important cost - he forgot what it's like to be starting out. It's a common problem, and one that inevitably plagues even the great ones. I don't mean to belittle what Mr. Hill does for a living. While I don't have a lot of experience with his work, I'm sure he's a great web developer. And I'm sure that his work is incredible because of the tools he has.

What I take issue with is that he seems to believe that this is what a new developer requires to be great. On the contrary, I think having tools like this at your disposal from the beginning causes one of two things.

  1. You start, the tools don't confer god-like web development skills. You get upset and leave.
  2. You start, the tools don't confer god-like web development skills. You get upset and work harder.

One of those two is good - and that same one doesn't require the initial investment that Mr. Hill thinks.

The real cost

  • Laptop (Hey, you probably have one of these right now!) ~ 700$
  • Books ~500$*
  • Linux (this one is optional, but if you're new to development I'd recommend it. There are a lot of great tools and utilities out there that show up on linux first. They won't be pretty, but god dammit they'll be awesome.)

*optional

Total Startup Cost: ~1200$

The books, of course, are not required but eventually you'll find that there are some things that people will always refer to that you'd like to have around. Javascript: The Good Parts for example. Or the Dragon Book. Books are very important.

The beauty of the web is that it isn't memory intensive when you're trying to figure out what you're doing. But it can grow to whatever you want.

What about training?

Technology moves fast. Really fast. You know that awesome new phone you got two months ago? Out of date. You know that great new framework you learned last year? Technology has made that irrelevant. I will agree with Mr. Hill to a certain degree here. A technology focused college education is quite the waste of time. However, I don't recommend skipping it right away. There are other things that a college education offers you apart from your program.

Presentations, working with others, taking charge of projects when you end up with a bunch of slackers. Making the tough decisions to kick that one dude out of your group because he does nothing. Essays, reports, being on time. And the most important to a FREELANCE WEB DEVELOPER THAT MR HILL SEEMS TO MISS. MANAGING YOUR TIME. For many people College is the first time when they're left to their own devices. They are responsible for themselves and they have to figure out how they work best, and how to manage their social lives AND their work lives. People are give 4 years to make this work. Four years when you're allowed to screw things up and start over. Because the thing is, once those 4 years are up, if you don't have some understanding of how to be you - you're pretty fucked.

However, don't waste your education on a technological degree. Instead, I'd recommend doing something unrelated. Psychology, Marketing, or even English/Theater.

See there's a weird stereotype about a lot of tech people - they tend to be rather introverted. This isn't true for everyone of course, but for those whom it is, you have to understand that even when you enter the workforce as a developer, you still need to interact with people. A LOT.

You have meetings and phone calls, you have to explain your choices to management and clients. If you're a freelancer, you have even MORE work. You need to be a sales guy, support staff and a developer. If you find it hard to talk to people - good luck.

It's a great time to be a good developer.
~ Jonathon Hill

How true it is Mr. Hill. It is a great time to be a good developer. But starting out with a 3000$ investment doesn't make you a good developer.

Having the drive to be better makes you a good developer.


Again, I feel like I have to point out - I think Mr. Hill has some great work. Browsing through his projects, I'm not claiming he doesn't know what he's talking about. Just that maybe he's forgotten what it's like to start out as a developer.


Git core.autocrlf

Linux is my development environment of choice. It wasn't always - I used to do all of my work with XAMPP and Windows, but eventually I got sick of waiting for cool things and just made the jump. Now I can't imagine actually getting any work done NOT in a terminal.

Vagrant and Virtual Box allowed me to have my linux environment for work and my Windows one for gaming. However, it means I get to run into a few issues I never have before - namely constant End-of-Line issues with git.

Windows uses the CRLF (Carriage return/Line Feed, /r/n) ending for lines, whereas unix uses just the LF (Line Feed, /n). This generally means that there's a whole mess of ^M characters in vim due to some files having the dos ending and some being unix-y. Git will try and correct this automatically but for the longest time I had no idea how to actually set that up.

core.autocrlf

Git has a configuration setting called core.autocrlf that tells git how you want it to handle line endings. There are three options:

  • true
  • false
  • input

true

Essentially turns on autocrlf which means that if you check something in/out it will preserve the line endings that the system expects. IE: On Windows it will have CRLF endings, on unix it'll have LF.

false

Turns it off (obviously). This essentailly leaves you alone.

input

Converts everything to unix-y endings.

In the end I settled on git config core.autocrlf true for my case. It modifies the line endings for dos based files to unix-y and then leaves me alone. Exactly the kind of option I'm looking for.


Making vim your IDE

I freaking love vim. Every so often I spend a couple weeks checking out another IDE to make sure I'm not missing anything and then I inevitably end up back at vim. Over time I've assembled a few plugins and workflow that make vim perfect for me.

pathogen

Pathogen is a plugin manager for vim that utilizes git to keep plugins organized. Once you add pathogen to your vim install, you can easily add new plugins simply by cloning the repo. There isn't much to say about pathogen except that it is the greatest.

NERDTree + NERDTreeTabs

The first ime you pop open vim you'll notice that you can't really browse around files. You need to know the filename that you're attempting to open, as well as the directory structure. While you can get around this with some tmux file listing, it would be a heck of a lot easier if it just had a file listing like a normal IDE. NERDTree does just that - a list of directories and files that you can navigate through with your keyboard.

NERDTreeTabs is an addon for NERDTree that allows you to have the same NERDTree instance open across multiple vim tabs.

tagbar

Tagbar is something that I don't actually use that often - however it's one that I'm always glad is around when I need it. It gives you the code strucutre of a file. Classes, methods and variables are all listed allowing you to easily jump to definitions and get an overview for what content is in a file.

Normally when I open a file I haven't been to in a while I'll pop open the tagbar and quickly glance at the methods to re-acquaint myself with things before I close it and get back to work. You can also jump straight to definitions by navigating around the tagbar, but I rarely end up doing that. There is a weird issue with having multiple buffers open simultaneously (I use a lot of vertical splits with vim) where you can't navigate to the tagbar as it displays the tags of the last split that you were in. Therefore I can't really navigate over to the tagbar from a split that isn't directly next to it. Technically I could move the splits around, but I feel like that's more trouble than it's worth.

ctrlp

Apart from NERDTree, ctrlp is the other way that I navigate files in vim. Often times I know the file that I want to work on.. but it's buried within a huge directory structure that I don't really feel like navigating or typing out. Instead ctrlp performs a fuzzy search over the current directory and allows me to open the file within a few keystrokes.

tmux

My development environment has always been linux - namely virtual box/vagrant which I SSH in to. This means I get a straight terminal view with minimal colors and no real "windows". In order to do a bunch of things at once, I use tmux which is a terminal multiplexer. Essentailly it allows me to replicate having multiple terminal tabs/windows within the same session. It even has the added benefit that if I get disconnected from the server I can reconnect and just re-attach my previous session. While tmux isn't actually a plugin for vim, I find using vim with tmux is a great deal better than just vim alone. I get to have a tmux window for code, for git, for mysql and for logs.

Install tmux: sudo apt-get install tmux

dotfiles

Now that I've kinda got my development environment customized how I like, I want to have it with me no matter where I am. I develop on multiple vms on two different machines, as well as multiple remote machines. I'd prefer if my environment was always the same. In order to ensure that I can get up and running as soon as possible on any new machine I run the following commands:

bash sudo apt-get install git tmux curl
mdkir ~/personal/
cd !$
git clone http://github.com/angelor/dotfiles
cd dotfiles
./setup

There aren't that many customizations in my .tmux.conf file and my .vimrc ones. Mostly it's to set up plugins and to remap some keys for tmux.


API Logging with Graylog2 - PHP Logging

This is Part 2 of the API Logging with Graylog2 series. View Part 1

Now that you have the backend components configured for logging, it's time to set up and configure GELF - the Graylog Extended Log Format.

Step 1: Composer

GELF support for PHP is only available via Composer. The installation instructures are pretty straight forward, so I won't attempt to go into too much detail - Composer does an excellent job of covering the basics. Composer Installation Instructions.

Once that's done and set up, you'll need to set up your composer.json file as follows:

{
    "require": {
        "graylog2/gelf-php": "0.1.*"
    }
}

Then just run composer install or composer update if you already had a composer.json file.

What this will do is grab the gelf-php libs and toss it into a ./vendor/ directory wherever the composer.json file exists. It will also configure an auto-loader so that you don't have to figure out what files to include.

Step 2: Log

Now that we've got the library where we want it, we can go ahead and start the logging!

$transport = new Gelf\Transport\UdpTransport('127.0.0.1');
$publisher = new Gelf\Publisher();
$publisher->addTransport($transport);

$message = new Gelf\Message();
$message->setShortMessage('some log message')
        ->setLevel(6);

$publisher->publish($message);

That's all there is to logging to Graylog2. However, there are a lot more things that you can add to your message to give your log a bit more substance.

Customizing your message attributes

One of the things that isn't really documented very well (with the library at least) is what exactly can constitute the message. The Message object includes a few additional methods that you can use to get the most out of Graylog2.

  • setShortMessage - Just a short descriptive message about the log.
  • setFullMessage - This is where you could include any backtraces or additional dumps.
  • setLevel - The "severity" of the log. It follows the standard syslog levels
  • setAdditional - This method accepts two args, the first being a custom key, the second being the value. This is a neat way of adding new information to your log (API keys for example).

Personally, I think Graylog2 is a phenomenal way to achieve a proper logging system - something that is often overlooked when you're in "app dev" phase. I've talked about planning when I talked about Lines of code as a metric and logging is definitely one of the most easily overlooked features - that's super easy to add from the beginning. Logging provides you, not just a way to track errors, but also track progress. Imagine tracking your API usage with Graylog2 and watching the requests/hour steadily rise. And then, because you thought about logging from the beginning, you can easily display the additional attribute "api_key" and "execution_time" that you've been logging to keep a better eye on your server.


Lines of Code

Lately there's been talk again about "Lines of Code" as a metric for progress on a project. The idea is, developers are being graded based on how many lines of code are checked in on a daily/weekly timeline. Unfortunately, not only is this unbelievably stupid, it's also not "new" stupid. It's the kind of stupid that's just been around for a while.

In early 1982, the Lisa software team was trying to buckle down for the big push to ship the software within the next six months. Some of the managers decided that it would be a good idea to track the progress of each individual engineer in terms of the amount of code that they wrote from week to week. They devised a form that each engineer was required to submit every Friday, which included a field for the number of lines of code that were written that week.

Bill Atkinson, the author of Quickdraw and the main user interface designer, who was by far the most important Lisa implementor, thought that lines of code was a silly measure of software productivity. He thought his goal was to write as small and fast a program as possible, and that the lines of code metric only encouraged writing sloppy, bloated, broken code.

He recently was working on optimizing Quickdraw's region calculation machinery, and had completely rewritten the region engine using a simpler, more general algorithm which, after some tweaking, made region operations almost six times faster. As a by-product, the rewrite also saved around 2,000 lines of code.

He was just putting the finishing touches on the optimization when it was time to fill out the management form for the first time. When he got to the lines of code part, he thought about it for a second, and then wrote in the number: -2000.

I'm not sure how the managers reacted to that, but I do know that after a couple more weeks, they stopped asking Bill to fill out the form, and he gladly complied.

Interestingly enough, this silly idea is back yet again. A recent post over on /r/programming talks about a teacher who grades based on LoC.

Senior developers write less code

Here's the thing. I believe that the newer you are to programming the more time you spend programming. Think about your early projects - you had an idea and you just jumped right into it didn't you? You knew exactly what you wanted to do, exactly what features you wanted, and you just hacked at it until it was done. You hit that Builders High. When you're new to programming you chase that high as ferverently as possible. You've never tasted anything quite so sweet.

Unfortunately, the time eventually comes to make some updates to your software. And you think "You know what, if I just re-write it with what I know now.. it'll be like 10x better!". And off you set yet again, chasing that high. The thing is, eventually you get sick of building the same thing over and over again and you eventually decide to just build it smart the first time.

So you sit down and plan out the requirements, you organize the database, and you organize your folder structure. You set up npm, Grunt and LESS. And before you know it, your done your day without a single line of code being written. But you feel productive - because you're no longer a code-monkey hacking out LoC.

Not a builder, an architect

See the difference between a Jr. dev and a Sr. dev is simple. Forethought. It's the act of sitting down and actually figuring out the best way to do something. To plan and research methods before deciding on what's best for the situation. You're no longer just building software. Now you're architecting a solution. You have to see the vision at the end, and then imagine all the little pieces and how they'll work together. As a Sr. developer you won't spend as much time writing code as you will planning code. You'll spend a heck of a lot of time planning. And you'll love it because you know it's the right way to do it.

Just the other day I was speaking to another developer about some new features they were planning on adding to their site. Turns out the software was designed with the eventuality of adding these features. Actually implementing them required a handful of changes and some testing time as opposed to the weeks it was originally estimated that it would take to augment the current application with the new features from scratch.

That's what the Sr. Developers best asset is. Forethought.

See you're not paying a Sr. Developer for how many lines of code they're writing. You're paying them for all the lines of code that no one needs to write.


API Logging with Graylog2 - Server Setup

This is a two-part piece. Normally, I try and stay away from these, but the setup process can be a little long. The second piece will go live tomorrow and will contain information about how to interact with the system that you set up.

I recently hit a rather interesting problem and I thought I'd share some researching around the matter. I was tasked with a simple feature:

Add additional metrics to an API logging mechanism

While it sounds simple enough, the additional metrics would cause our current table structure (logging in MySQL) some issues adding tons of rows and causing our api logging table to grow quite extensively. We were worried, to the point of unusability. So, as often happens when a developer is tasked with a "simple" task - the quick feature billowed into a rather monstrous task.

How do we overhaul our current logging infrastructure so that we can:
1. Add the new metrics that we want to track
2. Don't ever run into this issue again

After a bit of thought, we decided to spent the time working on our API logging system with the idea that we can roll out more logging to the rest of the site. Instead of just cobbling something together using our MySQL instance we decided to move our logging infrastrucutre to something a little more robust - namely Graylog2.

Before we go any further I just want to point out that this tutorial is NOT for getting a production ready variant of Graylog2 running. There is a LOT more configuration that should go into getting this running in a production environment. This is exclusively a local testing environment to see what all the hubbub with Graylog2 is about.

Graylog2

If you've never heard of Graylog2 think you better be prepared to have your socks knocked off.

Installation

The installation for Graylog2 is a little more invovled if you've only been used to getting lamp/lemp stacks running. There's config files that need configuring, applications that need installing and even a bit of finger-crossing. To make things a little easier for myself, I've opted to do the full Graylog2 install in a Vagrant VM. If you've never used Vagrant before, I recommend you check it out - it's very easy to get started with.

Step 1: Oracle JRE

If you get to this point and you're going to argue with me over the Oracle JRE vs Open-jdk then just use what you'd like. To be honest I couldn't care less, and I don't know either of the following apps will either. The only reason I'm installing Oracle Java is because out of the two I'd rather go with the official Java for something like this.

This is super easy thanks to the webupd8team's ppa.

sudo add-apt-repository ppa:webupd8team/java  
sudo apt-get update  
sudo apt-get install oracle-java7-installer  

If you get an error saying that "add-apt-repository" is not a valid command, just install the python-software-properties and software-properties-common . Then go back and add the repo.

sudo apt-get install python-software-properties software-properties-common  

The Java installer will take a bit of time to get through, and during it you'll have to accept a couple license agreements to continue.

Step 2: Grab MongoDB

Installing MongoDB is pretty simple, especially considering the Mongo team gives us an official repo. They have full instructions for this at this url: http://docs.mongodb.org/manual/tutorial/install-mongodb-on-debian/

Step 3: Elasticsearch

Once you have Java set up, you'll want to download the right version of Elasticsearch. Graylog2 relies on Elasticsearch but it requires a very specific version of it. If you have the wrong version, nothing will work and you'll end up with errors about Graylog2 not being able to read the entire message. It's a confusing error but it has to do with the way Graylog2 and Elasticsearch exchange information.

For this little walkthrough, I'm setting up the latest stable version of the Graylog2 server version 0.20.1 - This requires version 0.90.10 of Elasticsearch so go ahead and do the following

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.10.tar.gz  
tar zxf elasticsearch-0.90.10.tar.gz  
cd elasticsearch-0.90.10  

Now you'll want to edit the Elasticsearch config file located here: config/elasticsearch.yml.

You'll want to configure as followed:

cluster.name: elasticsearch  
transport.tcp.port: 9300  
http.port: 9200  

These are just the default settings, I've just uncommented them in the configuration file by deleting the #

Once that's done you can start up elastic search by running

./bin-elasticsearch -f

Step 4: Graylog2-server

Next you'll need to grab the Graylog2-server component. This is what actually handles the logging mechanism. Below, we'll download the version of the server we want, extract the tar.gz and move into that directory. Then we'll simply copy over the Graylog configuration file to its appropraite place and then pop in to edit it.

wget https://github.com/Graylog2/graylog2-server/releases/download/0.20.1/graylog2-server-0.20.1.tgz  
tar zxf graylog2-server-0.20.1.tgz  
cd graylog2-server-0.20.1  
sudo cp graylog2.conf.example /etc/graylog2.conf  
sudo vim /etc/graylog2.conf  

All we realy want to do here is confiugre the password_secret which is a salt that we'll be using to hash passwords. The configuration file has an example setup you can do to generate a salt.

Next we want to set the rootpasswordsha2 hash. This again, is pretty straight forward, and the configuration file shows you how to do this in the comments right above where you'd be setting these values.

Next you'll want to scroll down to line 70 and uncomment the elasticsearch_cluster_name and ensure that it is set to elasticsearch (or something else if you changed the name of the cluster in the elasticsearch.yml file.

With that done it's time to start up the Graylog2 server for the first time! Graylog recommends that you start it as follows the first time :

sudo java -jar graylog2-server.jar --debug  

Eventually you'll see a line saying that it started and is not doing anything. At this point you can kill it and use the start script ./bin/graylog2ctl start

Step 5: Graylog2-webserver

Finally is the pretty sweet Graylog2 webserver. This actually lets us configure the inputs on the server and allows us to actually SEE the data that we're logging.

wget https://github.com/Graylog2/graylog2-web-interface/releases/download/0.20.1/graylog2-web-interface-0.20.1.tgz  
tar zxf graylog2-web-interface-0.20.1.tgz  
cd graylog2-web-interface-0.20.1  
vim conf/graylog2-web-interface.conf  

At this point, we've been working entirely from our local server with a single instance of elasticsearch running. So our graylog2-server-uris should be set to http://127.0.0.1:12900.

Then you just need to set an application.secret, which is just a random salt that will be used for crypto functions.

Now you just need to start the server!

./bin/graylog2-web-interface

If all goes well, you should see a Listening for HTTP message and a port which is where the web-server is running. Now just point your browser over to http://localhost:9000 and log in with the credentials admin and the password being whatever you configured during your graylog2 server setup.

Take a look around the interface and under System/Inputs (http://localhost:9000/system/inputs) you'll want to add a new GELF/UDP Input. You can leave the defaults as is, and just give it a name so you know what it is.

With that done, you are now ready to start logging to your Graylog2 server with their GELF protocol.

In the next part, I'll cover how you can start logging to your Graylog2 instance over PHP.)


Proxies 101

At it's core, a proxy is a service that is designed to act as a "middle man".
That is, if there are two parties (Website, You) that are trying to communicate
neither of you talk to each other directly. Instead, you talk to a "proxy",
which then relays your information to the website. In return, the website
talks to the proxy, which then relays the information back to you.

Understanding Proxies

A basic conversion between YOU and MY WEBSITE if you connect through a proxy
might look something like this.

  1. You tell the proxy: "connect to http://xangelo.ca"
  2. The proxy connects to http://xangelo.ca and requests the home page
  3. I give the proxy the home page
  4. The proxy gives you the home page.

Pretty simple right?

Of course, the implications of this are enormous! I'll touch on this more later
on as there are a LOT of benefits of running behind a proxy.

Of course, not all proxies are made equal, and there are many different reasons
to use one.

Types of proxies

Proxies can be broken down depending on what layer of the OSI model
they function on. Each has its benefits and drawbacks, and depending on what
you are using a proxy for, one might be better than the other.

Layer 4 Proxies

If you're not familiar with Layer 4 of the OSI model, it's the "Transport"
layer. At this layer, the proxy doesn't see a "URL", but instead it sees IP
addresses.

Proxies are faster if they function at this stage, because as soon as they know
the IP address of what you're trying to get to, they can forward the traffic
there.

Layer 7 Proxies

This layer is the "Application" layer. At this layer, the proxy can see the
specific URL you are trying to access and even the content that is flowing
back and forth.

Forward and Reverse Proxies

In addition, you may hear about "Reverse" and "Forward" proxies. The thing is,
apart from a bit of technical jargon, these proxies function almost exactly
the same way. A forward proxy "forwards" your requests to the internet. A
"reverse" proxy forwards requests FROM the internet to a series of recipients.

Generally forward proxies have a single user that is making the request, whereas
a reverse proxy has multiple recipients that will handle the request from the
internet.

The benefits of a proxy

Personal Anonymity

By placing a proxy between you and a website, the website won't know that
YOU are accessing it. It will think the proxy is. A common use for this scenario
is to have a proxy in one country that you access. Any websites you visit
through the proxy will think you are coming from Country A, when you actually
reside in Country B.

This idea is the basis of technologies like TOR. Instead of connecting directly
to a website, you connect through what is essentially a series of proxies
and the website you are attempting to connect to will only see the details of
the LAST proxy. Of course, your traffic is still traceable through the proxy
chain, but every hop makes it a LOT harder.

Load Balancing

Load balancing is exactly what it sounds like. Imagine you're given a 300 pound
barbell. Instead of holding it in one hand, you use both your hands. This way,
you're not putting unnecessary stress on one hand - you're balancing the load
across both hands.

When it comes to web services, we "Load balance" to direct traffic between
multiple servers so that one server is trying to handle everything. This allows
us to have smaller servers, but also to have redundancy. If one server is not
functional, we can still serve our users because we have another one.

If you have a fairly simple web app (The entirety can be on a single server),
you can have multiple servers behind a proxy, you can use a Layer 4 Reverse
Proxy to balance traffic between them.

If you have a complicated web application, you can have a single URL, and then
using a Layer 7 Reverse Proxy you can route the user to different components
of your application. If a "Service Oriented Architecture" system, a Layer 7
Reverse Proxy is pretty essential.

Security

Today, everyone runs an anti-virus to protect themselves from malware and
viruses. However, when you run a company, buying anti-virus and managing it for
hundreds of users is a LOT of cost. It would be better if you could do it all
in once place.

By routing all requests through our proxy server, we can can the content that
websites send back and run it through virus detection software we can ensure
that all employees aren't getting viruses and malware sent back to them.

Proxies are, in addition, what power the black/white listing functionalities
that you find at various companies. With a blacklist, you can enter a URL or
IP address (Layer 7 or 4 respectively) and users that attempt to access those
websites are given stopped.

Data Leakage Prevention

In addition to stopping malware and viruses from getting in to a company, a
proxy can stop data from leaving a company as well. In large corporations,
data theft is a big issue and being able to detect data loss. By passing
data leaving the company through a DLP solution (part of which includes a
proxy) it allows you to stop data leakage before it happens.