Year 22

So I turned 22 yesterday and while I haven't accomplished much last year, I plan to change that. First, in the mind department, I will read books from Africa. First, I am going to try and learn the stories behind Nigerian personalities (I lived there for a year) and I think Achebe is a good place to start (Achebe was an Igbo who was around during the Nigerian Independence). Next, I will want to study about at least two wars in the Area and Liberia. Let us see how that goes.

Apart from that, my project for the year has begun taking shape and I am feeling a bit better about my PhD career. Hopefully I will do well here on and accomplish a lot.

Next, I need to lose weight. I take to comfort food and when things are not going well, I put on those pounds. That needs to change.

Heritrix Scraping

Recently, I was trying to download a bunch of things with a common regex format. The general idea (I thought) was to set an infinite number of hops and use a regex string to constrict the downloaded file-set. Of course, I couldn't find a way to do infinite hops (this is a good thing) but I finally figured out a way to do it.

Look at the scope section of a crawler.cxml file. It starts like this, so you can find it in your file.


<!-- SCOPE: rules for which discovered URIs to crawl; order is very important because last decision returned other than 'NONE' wins. -->

The list of rules are:

  • org.archive.modules.deciderules.RejectDecideRule
  • org.archive.modules.deciderules.TooManyHopsDecideRule
  • org.archive.modules.deciderules.TransclusionDecideRule
  • org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
  • org.archive.modules.deciderules.MatchesListRegexDecideRule
  • org.archive.modules.deciderules.MatchesListRegexDecideRule
  • org.archive.modules.deciderules.PathologicalPathDecideRule
  • org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
  • org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
  • org.archive.modules.deciderules.MatchesFilePatternDecideRule
  • org.archive.modules.deciderules.MatchesRegexDecideRule

To make it work, I just set up the hops to 1 and right before the org.archive.modules.deciderules.PrerequisiteAcceptDecideRule I added two regex rules. One rule to reject everything that didn't match and one to accept everything that did (the second one ended up being necessary to work around the number of hops rule (i.e if there's pagination to worry about). That was sufficient to visit just the urls that matched the pattern.

 

Installing Boost.MPI with a Custom Python and a Custom GCC

I was recently chasing a KDD deadline (I still am - so why am I writing this post? I don't know) and was using Boost.MPI's Python bindings for some computation.

At CMU, there's a cluster on which I use EPD-Free1 as my Python distribution which is located in:
~/opt/Python27/bin/python

I have a custom gcc/g++ (version 4.7.0) in /opt/gcc/

  • First, grab boost from here, decompress and cd into it.
  • Now, we need to install boost:
    • ./bootstrap.sh --prefix=~/opt/boost/ --libdir=~/opt/lib --with-libraries=signals,thread,python,mpi --with-python-root=/home/spalakod/opt/Python27/ --with-python-version=2.7
    • The above command generates a file called project-config.jam
    • This file contains some specifics about your python setup (it allows you to type in the path, version etc. in case it gets it wrong).
    • Despite MPI being specified, it got skipped. More on this later.
    • Now, do ./b2
  • Go to the stage/lib in the appropriate directory (for me it was the directory I got by decompressing the original tarball).
  • If you see an mpi.so there, you're good. I didn't so I had to do the following:
    • Create a file called user-config.jam
    • I placed one line in there: using mpi ;
    • Now, run ./b2 --user-config=user-config.jam
  • At this stage I had an mpi.so in stage/lib. Add /path/to/stage/lib to LD_LIBRARY_PATH and PYTHONPATH
  • Now, none of the tests will pass because the tests import mpi using
    import boost.mpi

    and the way it is installed, we will need to use:
    import mpi
  • I have attached an archive of the tests that import MPI the correct (or - if you insist - incorrect) way: https://github.com/shriphani/mpi_python_tests 2
  • Also, put the export statements (for LD_LIBRARY_PATH etc) in ~/.bashrc.
  1. http://www.enthought.com/products/epd_free.php
  2. Includes the MPI Python tests in the libs/mpi/test/python directory except skeletal_content.py

Lazy I/O Racket Routines

I needed a lazy version of port->lines and file->lines. Turned out to be a matter of minutes. See this gist:

dateutil problems.

Recently I had to deal with dateutil's parser. Apparently it is very powerful and lots of people masturbate to it and I managed to bring it to its heels with this:
>>> dparser.parse("P 16:08 May 14, 2003 UTC", fuzzy=True)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/pymodules/python2.6/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.6/dateutil/parser.py", line 301, in parse
res = self._parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.6/dateutil/parser.py", line 557, in _parse
res.hour += 12
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'int'

I am not sure what the fuck it is even doing...
Some form of AI possibly.