Saturday, December 14, 2013

My second choropleth map


In my last blog entry I wrote my first attempt at coloring areas in a map based on values associated with that area. In that case it was the number of people living in a province of the Netherlands. I thought of expanding that by doing the same for municipalities in the Netherlands. Sound like the same routine right, but just a bit more numbers. Well you've guessed it, it wasn't.

The easy part

First I looked at all the available maps on Wikipedia. There's at lot to choose from so I downloaded every empty map just to be sure. The problem is that not all maps are scalable vector graphics. The most recent maps are, and they have the nice feature that in the paths that describe the boundaries the name en area code are added. For example the first one is Appingedam with area code GM0003. Since these are official names and codes I was feeling confident that I would succeed to connect the right paths to the data I wanted to use.

A bit harder

I had decided to color the area of every municipality based on the number of inhabitants. I guessed that this data was really easy to find. And sure the most recent numbers you can the internet quickly enough. But I was thinking of comparing different years and finding that data needed some digging in the CBS files. I found out that the data was available but not in handy spreadsheets for me to download. What I did download was a pdf containing demographic information per municipality. There's a file like this for every year dating back to 2000 so that seemed like a nice start.
In this file information is grouped per province. So I took the time to copy-paste the pages I needed in a plain text file to use later on. The first line in that file looks like this:

0003 Appingedam 12 114 2 598 1 170 1 463 1 813 2 663 1 682 725 19,9

The first number is the area code as mentioned above (hurray), the second is the name (yeah) and what follows is lot's of numbers. In the original file a space is used to divide groups of three digits to improve readability. The first two numbers after the name therefore together make up the total number of inhabitants, in this case 12114.

Regular expressions

Since I felt this should be a learning experience I decided this would be a great opportunity to get my hands dirty with some regular expression magic in R. I finally came up with the following expression that works in this particular case:

regexp <- digit:="" i="">

Short explanation: :digit: makes it look for a digit and the brackets {} tells the number of digits. When it says {1,3} it's happy if it there are 1 to 3 digits. Somewhere in the middle there's a (.+) that is just looking for anything. In each line of the data file R tries to extract a piece that fits the description in the regular expression and puts that in a new list. The first element would be:
0003 Appingedam 12 114
Luckily this worked for the entire list I had with one exception. I assumed beforehand that every municipality had at least 1,000 inhabitants but it turned out that Schiermonnikoog only has 948. That mistake was easily found in the end when I was wondering why a small island, which in my opinion was pretty peaceful, seemed to have at least 800,000 inhabitants.
Breaking apart the string that remained wasn't hard after that. I just had to take into account that some names consist of more than one word, i.e. Kapelle aan den IJssel. The easiest solution was just taking off the last two elements in the string, since those are the two numbers that together make up the number of inhabitants.
Finally exporting the now nice looking data in a csv file and I could start thinking about using it to color the map.


Throughout this process I was following the example by Nathan Yau on making a choropleth map of U.S. counties. He was using Python with the BeautifulSoup library. Now I have my map finished I must I like it a lot. But times were different. Since this was my first real Python experience it took a lot of time understanding all the small things that can make a python script crash and burn (like a misplaced space).
In the end I had my data in a dictionary ready to be accessed when a certain path needed to be colored and also to find those area codes inside the svg map when needed. The end result (converted to png) looks like this.

There are some problems with this map. There's no legend or title for starters, but there are also some gray areas. Municipalities merge sometimes, and I had used the data from 2009 on a map of 2012. So that was just a dumb mistake. The third problem is the colors. I hadn't noticed the gray areas on first glance since I'm colorblind. The gray areas somehow just seem to blend in with the other colors, so I really want to find something in the future to make sure there's a better distinction between colored and uncolored areas.

Touching up

Just to make the final image look nicer I added a legend and a title in Inkscape before I did a run of the python script. I'm pretty pleased with the end result as it is, even with the before mentioned problems. Check it out below.
What I really hope to look into is the following. Now the svg file that first plain had gray areas now has colored areas so the entire file is altered. But if you would like to change the image you have to run the python script again. If you're thinking on using something like this on a website with interactivity that doesn't sound like it would work properly. You would want a svg image where the fill of the path can be changed whenever you like. So, back to the drawing board.

No comments:

Post a Comment