Autonomously crawling through DICE job postings

I am currently working on a book that requires me to search through thousands of job advertisements. For the last couple of days I have been looking at the various websites, collecting data and looking for patterns in employment listings. Even if you are not working on a book, I am sure at some point in time you will be looking for a new job online. I love searching for jobs, and if you don’t love it too… you are probably doing it wrong.
First of all don’t manually search for jobs! It is a waste of time and it will drive you insane. Instead use a scripting language, such as PERL, that mines website databases and outlines the best matches. In fact I wrote a post a few days ago about mining employment postings on craigslist. If you are new to this entire field of data-mining, I recommend the book “mining the social web” by Russell… Nice chap…. Met him at Harvard Square a couple of years ago.



Here I outline the steps I took to extract all job postings from DICE. First of all you have to know how everything is stored in the database. Make any random search on the initial screen (e.g. embedded).

"Embedded" search on dice.com

This particular search generated the following very-long URL… so long that I had to include spaces:

http://seeker.dice.com/jobsearch/servlet/JobSearch?op=300&N=0&Hf=0&NUM_PER_PAGE=30&Ntk=JobSearchRanking&Ntx=mode+matchall &AREA_CODES=&AC_COUNTRY=1525&QUICK=1&ZIPCODE=&RADIUS=64.37376 &ZC_COUNTRY=0&COUNTRY=1525&STAT_PROV=0&METRO_AREA=33.78715899%2C-84.39164034&TRAVEL=0&TAXTERM=0&SORTSPEC=0&FRMT=0 &DAYSBACK=30&LOCATION_OPTION=2&FREE_TEXT=embedded&WHERE=

Since this particular search detected 1689 job postings, we just have to change NUM_PER_PAGE=30 from 30 to 1689, in order to see every single job post on a single page. Save that file into your hard-disk in the HTML format. For completeness, here is the file with all 1689 postings I just downloaded. The following PERL script parses the contents of this file and looks for the associated URL for each job posting.

Save the file (e.g. dice.pl) and execute it with the following command:

perl dice.pl embedded_Jobs_at_Dice.html > embedded_url.txt

This will store every single URL, one per line, in the file embedded_url.txt. Once again… for completeness, here is my generated file.

The next step is to download every single job posting onto a separate file. Since, I am a macsox user, I need to download a the contents of each of the URLs from the web via the OS X command line. This is easy accomplished with the following bash script:

On the same directory as the output of the previous PERL script (e.g. embedded_url.txt), save this bash scrip (e.g. download_all_jobs.sh) and execute it with the following commands:

chmod u=+rwx download_all_jobs.sh
./download_all_jobs.sh embedded_url.txt

The compressed outcome of this last step is a file of 27 MBs.

Now that I have all this data, I need to extract the skills required for each advertised position. So, I placed all the compressed files in the sub-directory dice_jobs and ran the following script:

The skill extraction is actually done on the following PERL script (extract_data.pl).

I then feed the extracted data into a mathematica script; a (readable) pdf version of the Mathematica script is here, and the source is here. In this script, I combined all found skills, ignored skills that were required in less than 30 distinct advertisement (e.g. COBOL and Pascal). Below is the resulting piechart.

Most requested skills in embedded computing jobs.

As expected the most sought after skills in “embedded computing” jobs are C,C++ and Linux. Java, mysql and kernel development is also very strong in demand these days. Surprisingly I saw lots of mobile computing and networking skill requests. However the most surprisingly requested skills is databases (mysql)!

Finally, I am aware that I could have done everything on this post on a single PERL script. However, writing a post about a single script would get tedious very quickly. I also wanted to save the outcome of every single step in my hard-disk so I could perform some additional data tests, without having to connect to dice.com each time.