Recently, I started working on a personal project to crawl TvTropes, essentially a casual wiki for discussing pop culture, make a graph of those works which reference each other, and then calculate the “impact factor” of each work in the same way as Google judges the quality of a website by the people that link to it.
I had this as my first attempt at crawling TvTropes.
It was horribly broken of course, Python doesn’t let you recurse
more than 1000 levels by default. This is actually a feature, since
such deep recursion is usually a bug outside of functional programming.
The reason I was running into it here was because I was essentially
trying to recursively walk a tree with a node for every page on TvTropes,
which has about a half million of them by my count. My initial workaround
was to just do
sys.setrecursionlimit(2000000) and get rid of the limit.
This wasn’t the right thing to do and the crawler was missing substantial
numbers of pages.
The solution to my problem was to just use the scrapy library. It was much faster than my code and wasn’t missing pages. As a general rule, if you have a need for some code that does X and you’re not trying to learn more about X you should look for a library instead of rolling your own. If you have to roll your own, then you should make it an open-source project and give back to the community. The way scrapy “walks the tree” if it has a queue of webpages to crawl that it pushes webpages onto and pops them out of. This gets around the issue of limited stack space, by allocating everything on the heap and is the efficient way if your language does not automatically optimize tail recursion. Spoiler alert, Python doesn’t.
In a language that does optimize tail calls, such as Haskell instead of allocating more room on the stack every time we call a function, we can check if the function call is the final expression in a method and choose to reuse the stack space from the function that we’re calling from. This works because we know none of the data stored in that part of the stack is needed anymore.