Automated craigslist job search with Perl and Bash
Posted: March 30, 2012 Filed under: Operating Systems, Programming Languages | Tags: Bash, Linux, Mac OS X, Perl Leave a comment »Most of my students are in the job market and after suggesting them websites where they could look for jobs, I took a peak at craigslist. I like craigslist; its a simple, bare-bones website with pure text. However, the search functionality is a bit awkward, and it is hard to find a good match between the candidate skills and a particular job posting. If you are seriously looking at every single “filtered” post, it may still take you over an hour to look for the best skill-to-job matches. So I created two scripts, one in Perl and the other one in Bash, that scavenge all the job postings for skill matches, and create a new webpage with all the appropriate positions and matched skills in a ready to click link. Data mining at its best!
There are some “limitations” of these scripts. First of all they were only tested in macOSX and Linux, however I am sure you can convert them quite easily to Windows. Secondly, I’ve focused all the craigslist searches around New England. You may add other craigslist locations quite easily by following the instructions on the perl script.
This automated job search requires two files: search_jobs.sh and craigslist.pl. Both can be found below, or at my github repository. To run the code place both files in the same directory, and edit the search_jobs.sh, shown below, with a text editor (after emacs, my second favorite text editor is TextWrangler). In this file modify the appropriate keywords that are being assigned to the variable SEARCH_SKILLS. Currently the search skills are the standard qualifications for an engineering graduate.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
#!/bin/bash ################################################################################# #Functions (do not modify anything here) ################################################################################# SEARCH_SKILLS="" function search_jobs { SEARCH_NAME=${1} perl craigslist.pl ${SEARCH_SKILLS}> raw_data.txt #create the header of an html file echo "<html><title>Job Search Results</title><body>" > job_data.html #sort the entire file contents and make sure the best matches are on top sort -t! -n -r -k3 raw_data.txt >> job_data.html #clean up the file perl -p -i -e "s/!/\ \ \ \ \ \ /g" job_data.html #terminate the html file echo "</body></html>" >> job_data.html mv job_data.html ${SEARCH_NAME}.html rm -f raw_data.txt } ################################################################################# #You may modify your skills below ################################################################################# SEARCH_SKILLS="embedded, circuit, transistor, VLSI, firmware, RTOS, kernel, MacOSX, JTAG, oscilloscope, HDL, FPGA, Arduino, MSP430, OMAP3540, micro-controllers, microcontrollers, SVN, programmer, Perl, linux, Mathematica, LabVIEW, schematics, Verilog, VHDL" SEARCH_NAME="engineering" search_jobs ${SEARCH_NAME} SEARCH_SKILLS="quantitative, mathematica, finance, programmer, developer, high-frequency, fpga, microcontroller" SEARCH_NAME="finance" search_jobs ${SEARCH_NAME} |
This script runs with the following command line:
./search_jobs.sh
After it is done executing it will create two files engineering.html and finance.html, where the candidate can see his best job matches.
Below is the Perl script that parses the craigslist job postings.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
#This script fetches the last 2 days new job postings from craigslist that match #a specific criteria and reports the URLs that correspond to that match. #The search criteria comes from the input arguments. The cragislist sites #are hardwired to the New England area. You may change them by manually #altering the variables in Section #3. # #Version 0.2 30/march/2012 #Author: Nuno Alves # ############################################################################# #Section #1 - load libraries ############################################################################# use strict; use POSIX; use LWP::Simple; ############################################################################# #Section #2 - input arguments are your skillsets ############################################################################# my $num_args = $#ARGV + 1; if ($num_args == 1) { print "You must add some skills as arguments\n"; exit; } ############################################################################# #Section #3 - defining variables ############################################################################# #what cragislist sites my @search_site=("http://boston.craigslist.org","http://nh.craigslist.org","http://maine.craigslist.org","http://burlington.craigslist.org","http://westernmass.craigslist.org","http://worcester.craigslist.org"); #type what positions you are looking for (egr = engineering, sof = software) my @positions=("egr","sof","bus","acc"); #this array contains the arguments which are your resume skills my @skills=@ARGV ; ############################################################################# #Section #4 - debug code ############################################################################# #instead of work on every single URL, setting $debug=1, will just scan #two webpages my $debug=0; my @debug_urls=("http://boston.craigslist.org/gbs/egr/2902012136.html","http://boston.craigslist.org/bmw/egr/2929181526.html","http://boston.craigslist.org/gbs/egr/2926742528.html"); ############################################################################# #Section #5 - subroutines for collecting craigslist data ############################################################################# sub collect_job_posting_http { my $url=$_[0]; my $content = get $url; #print $content . "\n"; my @splitcontents=split(/<h4 class=\"ban\"/,$content); my $size_splitcontents=@splitcontents; my @url_data=(); for (my $i=1 ; $i<$size_splitcontents ; $i++) { #just want the last 2 days of postings if ($i<3) { #print "============\n\n\n"; #print $splitcontents[$i] . "\n"; #get all the posting urls for this particular day my @postingdata=split(/<p><a href=\"|\">/,$splitcontents[$i]); for (my $j=0; $j<@postingdata ; $j++) { #print ">>[$j]>>" . $postingdata[$j] . "<<<\n"; if ($postingdata[$j]=~m/^http/) { push(@url_data,$postingdata[$j]); } } } } return(@url_data); } sub extract_date { my @url_data=$_[0]; my @date_data=split(/Date: 2012-|EDT<br>/,$url_data[0]); return("2012-" . $date_data[1]); } ############################################################################# #Section #6 - main program: collecting http data for each job posting ############################################################################# my @urls=(); if ($debug == 0) { for (my $k=0;$k<@search_site;$k++) { for (my $z=0;$z<@positions;$z++) { my $base_url=$search_site[$k]."/".$positions[$z]; my @tmp_data=collect_job_posting_http $base_url; push(@urls,@tmp_data); } } } else { @urls=@debug_urls; } #foreach (@urls) #{ # print $_ . "\n"; #} ############################################################################# #Section #7 - check if each posting matches at least one skill ############################################################################# my @matched_skills=(); my @skill_type=(); my @post_date=(); for (my $i=0 ; $i<@urls ; $i++) { my $url=$urls[$i]; my $content = get $url; my $counter=0; my $date; # print $url . "\n"; # print $content . "\n"; my $skill_type_desc=""; for (my $k=0; $k<@skills ; $k++) { if ($content =~ m/$skills[$k]/i) { $counter++; $skill_type_desc = $skill_type_desc . $skills[$k] . " "; } } push(@matched_skills,$counter); push(@skill_type,$skill_type_desc); push(@post_date,extract_date($content)); } ############################################################################# #Section #8 - print results to the screen ############################################################################# for (my $i=0; $i < @matched_skills ; $i++) { if ($matched_skills[$i]>0) { print "<li><a href=\"$urls[$i]\">site #$i\<\/a\>" . "!" . $post_date[$i] . "!" . $matched_skills[$i] . "!" . $skill_type[$i] . "\n"; } } |
File related Linux bash snippets
Posted: March 9, 2012 Filed under: Operating Systems, Programming Languages | Tags: Bash, Linux Leave a comment »Here are some of extremely useful Linux bash snippets I use all the time to parse experimental data from my simulations.
How to extract the top (insert number here) lines from a file
Consider a file named test-file.txt. You can extract the top 14 lines from that file using the following:
|
1 |
head -14 test-file.txt |
How to extract specific lines in a file using regular expressions.
Consider a file, named test-file.txt, with the following lines:
_N37_:0:_N262_:1:_N696_:0
_N37_:0:_N233_:0
_N37_:0:_N263_:0:_N694_:0
_N37_:1:_N113_:0
To extract the lines that have 5 elements we can type:
|
1 |
grep \:.*.\:.*.\:.*.\:.*.\: test-file.txt |
To extract the other lines, we can simply negate that regular expression:
|
1 |
grep -v \:.*.\:.*.\:.*.\:.*.\: test-file.txt |
How to search and replace text on a file
|
1 |
perl -p -i -e "s/string1/string2/g" file.txt |
Where string1 is what you are searching for, string2 is what you want it to be replaced with and file.txt is the file you want to perform this operation on.
How to print a specific line number inside a file using a variable
|
1 2 |
LINENBR=3 sed -n $LINENBR'p' filename.txt |
Where $LINENBR is the number of the line you want to print.
How to append 2 files, column by column, keeping particular columns
|
1 2 |
paste -d"_" file1.txt file2.txt | cut -d"_" -f1,4 > file3.txt This will paste file1.txt into file2.txt separating them by the delimiter "_". Then, it will extract column 1 and 4, delimited by "_", and place it into file3.txt. |
How to transfer files across computers with ssh
The scp command copies files to a remote Linux system.
|
1 |
scp (file) (user)@(host.domain):(path) |
To copy files from a remote system to your local system:
|
1 |
scp (user)@(host.domain):(path) (destination) |
How to remove a file extension
|
1 2 |
FILENAME="hello.cpp" FNOEXTENSION=`echo ${FILENAME} | cut -d "." -f1`; |
How to remove redundant lines inside a file
In the terminal type:
|
1 |
sort (filename) | uniq |
How to delete the last line of a file
In the terminal type:
|
1 |
sed -n '$!p' (file) |
How to count the number of lines in a file and write that number into another file
In the terminal type:
|
1 2 |
impcount=`wc -l < ${CIRCUIT}-impID.dat` echo ${impcount} >> (filename) |
Alternatively you can also store the number of lines into a shell variable.
|
1 2 3 4 5 |
impcount=`wc -l < ${CIRCUIT}-impID.dat` echo ${impcount} > xxx.tmp NIMPS=`perl -n -e '@splitline=split(/ /,$_); $splitline[1]=~s/ //g; print $splitline[0] ."\n"; ' xxx.tmp`; rm -f xxx.tmp echo ${NIMPS} |
How to perform the same operation over several files
In a bash script type:
|
1 2 3 4 |
for file in $( ls FAULTY* ) do echo "$file" done |
this will print out all filenames via an echo command, that will have the regular expression FAULTY*
How to write a loop inside a bash file
|
1 2 3 4 5 6 7 |
#!/bin/bash RUN=1 until [ $RUN -eq 10 ] do echo "the current loop index is ${RUN}" RUN=$(( $RUN + 1 )) done |
How to split a string inside a bash file
|
1 2 3 4 5 |
x=a:b:c #printing a:b echo ${x%:*} #printing b:c echo ${x#*:} |
How to do input parameter error testing
The variable ${CIRCUIT} is the first command line argument (${1})
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
if test -z ${CIRCUIT}; then echo "" echo "ERROR: First Parameter is missing. Add the bechmark name (eg c17.v)." echo "" exit fi if test ! -e ${CIRCUIT}; then echo "" echo "ERROR: Specified benchmark file does not exist." echo "" exit fi |
How to perform a particular operation on each line of a file
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
FILENAME=${CIRCUIT}-sites_in_path.txt NUMBERLINES=`wc -l < ${FILENAME}` echo ${NUMBERLINES} > xxx.tmp PNUMBERLINES=`perl -n -e '@splitline=split(/ /,$_); $splitline[1]=~s/ //g; print $splitline[0] ."\n"; ' xxx.tmp` rm -f xxx.tmp echo ${PNUMBERLINES} RUN=0 until [ ${RUN} -eq ${PNUMBERLINES} ] do RUN=$(( $RUN + 1 )) LCONTENTS=`sed -n $RUN'p' ${FILENAME}` echo "line # ${RUN} with contents : ${LCONTENTS}" done |
How to ensure that two files have the same number of lines
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#!/bin/bash function compareNlines { fileA=${1} fileB=${2} impcount=`wc -l < ${fileA}` echo ${impcount} > xxx.tmp A=`perl -n -e '@splitline=split(/ /,$_); $splitline[1]=~s/ //g; print $splitline[0] ."\n"; ' xxx.tmp` rm -f xxx.tmp impcount=`wc -l < ${fileB}` echo ${impcount} > xxx.tmp B=`perl -n -e '@splitline=split(/ /,$_); $splitline[1]=~s/ //g; print $splitline[0] ."\n"; ' xxx.tmp` rm -f xxx.tmp if [ ${A} -ne ${B} ]; then echo "ERROR: both files MUST have the same number of lines" exit; fi } compareNlines file1.txt file2.txt |
Where file1.txt and file2.txt are the filenames you wish to compare.
How to count the number of characters in file
In this particular example, the character I am counting is the 0.
|
1 2 |
res=`tr -dc '0' < file.in | wc -c` echo {res} |
How to replace characters in a file
This particular command will replace all “:” with the newline character.
|
1 |
tr ':' '\n' < longString.txt > readableString.txt |
How to delete all instances of a particular character from a file
This particular command will delete all “:” from the file longString.txt and it will write it on the file readableString.txt.
|
1 |
tr -d ':' < longString.txt > readableString.txt |



