Browse: Page 2
By shriphani on March 14, 2013
Recently, I was trying to download a bunch of things with a common regex format. The general idea (I thought) was to set an infinite number of hops and use a regex string to constrict the downloaded file-set. Of course, I couldn't find a way to do infinite hops (this is a good thing) but I finally figured out a way to do it.
Look at the scope section of a crawler.cxml file. It starts like this, so you can find it in your file.
<!-- SCOPE: rules for which discovered URIs to crawl; order is very important because last decision returned other than 'NONE' wins. -->
The list of rules are:
org.archive.modules.deciderules.RejectDecideRule
org.archive.modules.deciderules.TooManyHopsDecideRule
org.archive.modules.deciderules.TransclusionDecideRule
org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
org.archive.modules.deciderules.MatchesListRegexDecideRule
org.archive.modules.deciderules.MatchesListRegexDecideRule
org.archive.modules.deciderules.PathologicalPathDecideRule
org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
org.archive.modules.deciderules.MatchesFilePatternDecideRule
org.archive.modules.deciderules.MatchesRegexDecideRule
To make it work, I just set up the hops to 1 and right before the org.archive.modules.deciderules.PrerequisiteAcceptDecideRule I added two regex rules. One rule to reject everything that didn't match and one to accept everything that did (the second one ended up being necessary to work around the number of hops rule (i.e if there's pagination to worry about). That was sufficient to visit just the urls that matched the pattern.
Posted in Uncategorized |
By shriphani on February 23, 2013
Recently, I was burned by MPI (installation headaches and every Python binding crashes conveniently) and we managed to miss a KDD deadline. Later, Alex Smola gave a talk and we had the chance to ask him about how to dig ourselves out of our hell-hole, he told us to use Ice. I kicked off by implementing the Hello-World application. I then put together some code to do async updates (which are the exact thing we need).
I will describe how to implement a function that maintains no global state and one that does.
No Global State:
The setting:
- The Master exposes a function (fibonacci in this case).
- The Slaves call the Master's fibonacci function in an asynchronous fashion.
First, we simply need to create the .ice file which Ice uses to produce stubs for our code:
Generate the stubs using slice2cpp. Now, we simply implement the fib function. The rest of the stuff is mostly boilerplate (I have hard-coded the identifier and the port number. This is obviously not production-ready code). Note that the FibberI class is from the stub that slice2cpp generates.
Now, the slaves. The few things we need to remember are:
-
begin_<function_name> kicks off an asynchronous call. It return an AsyncResultPtr object which can be used to test for completion, failure and for extracting what the Master/Server returned.
- Test for completion (in a loop or something using
r->isCompleted().
- Obtain the return value using
end_<function_name>(r)
- In the above statement, r is of type
AsyncResultPtr.
The slave looks like this:
With Global State:
In this setting:
- The Master maintains a global array and takes arguments (position and data-item) from the slave and performs the required modifications.
- We init a bunch of slaves (using a for loop on a command line) and they asynchronously issue modification commands.
It is a trivial leap from the previous version to this one. See the github repo directly: https://github.com/shriphani/ice_mod_arr_async_tut
The first example is at: https://github.com/shriphani/ice_async_tut
Posted in Daily life | Tagged c#, Concurrency, Examples, Ice, Master Slave, MPI, Parallelism, Zeroc |
By shriphani on February 18, 2013
I was recently chasing a KDD deadline (I still am - so why am I writing this post? I don't know) and was using Boost.MPI's Python bindings for some computation.
At CMU, there's a cluster on which I use EPD-Free as my Python distribution which is located in:
~/opt/Python27/bin/python
I have a custom gcc/g++ (version 4.7.0) in /opt/gcc/
- First, grab boost from here, decompress and
cd into it.
- Now, we need to install boost:
./bootstrap.sh --prefix=~/opt/boost/ --libdir=~/opt/lib --with-libraries=signals,thread,python,mpi --with-python-root=/home/spalakod/opt/Python27/ --with-python-version=2.7
- The above command generates a file called
project-config.jam
- This file contains some specifics about your python setup (it allows you to type in the path, version etc. in case it gets it wrong).
- Despite MPI being specified, it got skipped. More on this later.
- Now, do
./b2
- Go to the
stage/lib in the appropriate directory (for me it was the directory I got by decompressing the original tarball).
- If you see an
mpi.so there, you're good. I didn't so I had to do the following:
- Create a file called
user-config.jam
- I placed one line in there:
using mpi ;
- Now, run
./b2 --user-config=user-config.jam
- At this stage I had an mpi.so in stage/lib. Add
/path/to/stage/lib to LD_LIBRARY_PATH and PYTHONPATH
- Now, none of the tests will pass because the tests import mpi using
import boost.mpi
and the way it is installed, we will need to use:
import mpi
- I have attached an archive of the tests that import MPI the correct (or - if you insist - incorrect) way: https://github.com/shriphani/mpi_python_tests
- Also, put the export statements (for LD_LIBRARY_PATH etc) in
~/.bashrc.
Posted in Daily life, python |
By shriphani on January 17, 2013
We are supposed to do these for class. Figured this is a reasonable place to put them after I submit them for grading (i.e. these are mostly for review and use them as a reading material at your own risk).
Get it from my dropbox.
Posted in Computer Science | Tagged bayes-nets, graphical models, markov networks, PGM, pgm-notes |
By shriphani on January 15, 2013
I have been trying to use Racket for my research (when I can that is) and recently I had to use Yahoo!'s BOSS API for some work. I couldn't find an oauth library for Racket so I had to roll one out myself. The lib currently supports HMAC-SHA1 signing and can only handle consumer requests (since that is all I had to accomplish). If you need our library, you can get it from the LemurProject github repo.
Posted in Computer Science | Tagged Lisp, oauth, oauth-consumer, Racket, Research, Scheme |
By shriphani on January 14, 2013
I needed a lazy version of port->lines and file->lines. Turned out to be a matter of minutes. See this gist:
Posted in Uncategorized |
By shriphani on December 4, 2012
I recently had to implement language-identification for some experiments with Clueweb-12++. I am not a racket expert and this code is possibly very stupid but it was mainly a learning exercise.
I need to implement serialization in order to obtain the space gains. Currently I write bools out and I am carrying around extra info.
And this is the language-id module itself.
You can download the entire src here.
Posted in Computer Science, Mathematics | Tagged bloom-filter, Functional, language-id, Lisp, Racket, Scheme, what-language |
By shriphani on October 24, 2012
Recently I had to deal with dateutil's parser. Apparently it is very powerful and lots of people masturbate to it and I managed to bring it to its heels with this:
>>> dparser.parse("P 16:08 May 14, 2003 UTC", fuzzy=True)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/pymodules/python2.6/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.6/dateutil/parser.py", line 301, in parse
res = self._parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.6/dateutil/parser.py", line 557, in _parse
res.hour += 12
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'int'
I am not sure what the fuck it is even doing...
Some form of AI possibly.
Posted in python |
By shriphani on October 23, 2012
Posted in Uncategorized |
By shriphani on October 7, 2012
I recently forgot Lagrange Multipliers and this is fucking embarassing. But Dan Klein came to my rescue with this document:
Lagrange Multipliers Without Permanent Scarring
Posted in Mathematics |