Web scraping is the extraction of data from web pages. But most web pages aren’t designed to accomodate automated data extraction; instead, they’re designed to be easily read by humans, with colors and fonts and pictures and all sorts of junk. This makes web scraping tricky. There are two predominant techniques for web scraping: HTML parsing and browser automation.

Before going on, I must confess a shameful secret: I don’t understand HTML very well. It’s just too ugly to get me interested. Every so often I’ll try to sit down and read about HTML, and I usually get bored and quit right around the time they get to unordered lists (<ul>). Why couldn’t they just use S-expressions? Do the brackets and explicit close tags actually add anything? Whatever, it doesn’t matter. The bottom line is that I hate dealing with HTML and I’d prefer to avoid it if I possibly can.

So I’m left with browser automation if I want to scrape. But for simple scraping tasks, especially one-off tasks, most browser automation tools seem like overkill. You have to download the thing, then figure out how to use it, wade through documentation, learn the relevant APIs, blah blah blah. Maybe it’s another personal failing, but I hate doing all that stuff, and again I would prefer to avoid it if I possibly can.

So what am I to do when I want to scrape? As usual, the answer is easy: Emacs. And why not? In most cases the data I want to scrape is text, and Emacs is an all-purpose text-handling tool, so really, what else would I use?

As an example, consider Hurriyet Daily News1, an English-language Turkish news site. Comparing it to American news outlets in terms of jounalistic quality, I would it’s like CNN – not a real journalism organization like the New York Times, but also not a propoganda dissemination machine like Fox News. If you want to keep up with Turkish news and you don’t speak Turkish, it’s not a bad option.

Here’s what hurriyetdailynews.com looks like:

img

The centerpiece of the landing page is a box containing a half dozen or so headlines with accompanying images. These headlines scroll through one at a time. Suppose, for whatever reason, that I’m interested in tracking the headlines that show up in that box. Here’s how I would do it in Emacs.

First, pop the site open in the Emacs brower eww2. It should looke like this (and if it doesn’t, then the website has changed and this post is out of date):

img

There are the headlines, together with topic keywords and some kind of text object corresponding to the story (I think it doesn’t show up because Javascript isn’t run). Now, open an empty buffer. I usually call my empty buffers asdf, but you can call yours something else if you want. Our ultimate goal is to copy each of those headlines into the empty buffer, at which point we can do whatever with them.

To do this, we’ll use a keyboard macro3. Steve Yegge once said “I believe I can state without the slightest hint of exaggeration that Emacs keyboard macros are the coolest thing in the entire universe”, and he’s not wrong. Keyboard macros make boring, repetitive tasks quick and even fun (I’ll sometimes spend more time trying to craft the perfect macro than it would have taken me to do it manually). The way keyboard macros work is you start recording, then hit some keys, then stop recording. When you play back the macro, the keys you recorded will be entered again. The meaning of the keys is not recorded, just the keys themselves, so be careful!

Now, pay attention here, because the details are important (except for the details of the headlines, which don’t matter at all). For reference, the first few lines of the headline section looks something like this:

Home Page

WORLD


Saudi consul in Istanbul 'relieved of post, to be investigated' as police knock door

TURKEY


Saudi journalist Khashoggi decapitated after fingers cut off: Reports
  1. Move point (cursor) to the beginning of the line in the eww buffer that says Home Page.
  2. Start recording a keyboard macro. The default binding for this is C-x (.
  3. Hit TAB. This will jump down to the first topic keyword, which in this case is WORLD. (This is a link of some kind).
  4. For whatever reason, the headlines can’t be reached by TAB-jumping, so move the cursor down three lines (C-n, or <down>).
  5. The cursor should now be at the beginning of the line that says “Saudi consul…”. If it isn’t, move it there with C-a. Now highlight the whole line. This can be done by setting the mark and moving to the end of the line (C-SPC C-e), but it can be done other ways too.
  6. Copy the highlighted text, or kill it or whatever the weird Emacs terminology is. I use C-k for this, but that isn’t the default binding, which I can never remember.
  7. Jump over to the empty buffer. The default binding for this is C-x b, which is an unbelievably shitty way do something as common as changing buffers. Anyway, hit that and then enter the name of the empty buffer (asdf for me).
  8. The cursor should be at the beginning of the buffer, which should have nothing in it. Paste in (or yank or whatever) the copied text. I use C-v for this, which again is not the standard binding.
  9. Enter a newline (RET). The cursor should be at the beginning of an empty line at the end of the buffer.
  10. Jump back back to the eww buffer. The cursor should be at the end of the “Saudi consul…” line.
  11. Stop recording the macro. The default binding for this is C-x ).

At this point the previously empty buffer should have the first headline, with an inactive cursor at the beginning of an empty line below it, and the active cursor should be at the end of the first headline in the eww buffer. Good? Okay, now execute the macro with C-x e. If it worked, the situation should be the same, but with the second headline copied into the other buffer, and the cursor at the end of the second headline in the eww buffer. Neat, right? If it didn’t work, something got screwed up, and there’s no telling what happened. Undo whatever it did and try again.

There are a few more headlines, so execute the macro as many times as needed to get all of them. For convenience, after hitting C-x e the first time, the macro can be replayed again by just hitting e.

The copy buffer should look like this:

Saudi consul in Istanbul 'relieved of post, to be investigated' as police knock door
Saudi journalist Khashoggi decapitated after fingers cut off: Reports
Suspects in Khashoggi case had ties to Saudi crown prince: Report
Turkey to clear Manbij if US fails to do so: Erdoğan tells Pompeo
Thousands of Turkish police, watchmen receive commando training
Istanbul metro receives first reverse vending machine
Dust storm from Syria immerses Turkey in orange cloud

And the headlines are scraped! Obviously this was a somewhat labored explanation, but once you get the hang of keyboard macros, this kind of thing can be done very quickly.

Okay, but there are new headlines every day; what if we want to scrape them regularly? It would be annoying to have to fiddle with keyboard macros every time.

Fortunately, macros can be named and saved. Go to your favorite config file or whatever and execute the following4:

(let ((macro-name 'hurriyet-scrape))
  (name-last-kbd-macro macro-name)
  (insert-kbd-macro macro-name))

It should spit out something like this:

(fset 'hurriyet-scrape
   [tab ?\C-n ?\C-n ?\C-n ?\C-  ?\C-e ?\C-k ?\C-t ?a ?s ?d ?f return ?\C-v return ?\C-t ?e ?w ?w return])

Now, if you wanted to leave it at that, you could, and you would, as far as anyone could tell, have a function that did exactly what the macro did. You could call it, bind it to a key, whatever. However, with a macro as complex as this one, it’s usually better just to write a real function. This can be done without too much trouble, as the bulk of the work is just figuring out what commands the key presses are bound to, and then putting those in the function. It doesn’t have to be fancy.

Here’s a function for scraping Hurriyet based on that macro. It grabs the headlines and then dumps them into a file called hurriyet-headlines along with a timestamp. Some example output:

2018-09-21 14:26:04 UTC

Turkey, Russia agree on borders of Idlib disarmament zone
German FM praises Turkey’s efforts for Idlib
Turkey expects US to implement Manbij roadmap without delays
Main opposition lawmaker Berberoğlu speaks after release from prison
Letter with forged signature of Erdoğan stirs Swiss controversy
Turkish mother found alive after going missing in wild for three days


2018-09-20 13:08:09 UTC

Turkey’s medium-term economic program revises inflation, growth targets
Turkey will protect its energy rights in Mediterranean: Minister
President Erdoğan meets representatives of US companies in Turkey
Ankara sharply cuts investment levels for Turkish citizenship
Turkish mayor of town bordering Syria attacked
Main opposition leader criticizes party’s performance in June elections
Turkey to work to strengthen ties with Russia: Minister


2018-09-19 17:42:32 UTC

No crisis in Turkey, all manipulations: President Erdoğan
Ankara sharply cuts investment levels for Turkish citizenship
Turkish mayor of town bordering Syria attacked
Main opposition leader criticizes party’s performance in June elections
Turkey to work to strengthen ties with Russia: Minister
24 workers arrested after new Istanbul airport protests
85-year-old man kills wife in Istanbul over ‘social media jealousy’

And the code itself:

(require 'shr)

(defun scrape-hurriyet-headlines ()
  "Scrape the top Hurriyet Daily News headlines.

The Hurriyet home page is expected to be laid out as follows:

<front matter>

Home Page

<topic -- LINK>
<story>

<headline>

<topic -- LINK>
<story>

<headline>

<topic -- LINK>
<story>

<headline>

...

The scraping strategy will be to jump to that home page section, then
walk down the first seven links and copy the headlines associated with
them, pasting them in to a result file.
"
  (interactive)
  (let ((site "http://www.hurriyetdailynews.com/")
        (file (find-file "~/hurriyet-headlines"))
        (headline-count 7))
    ;; Add date and time
    (switch-to-buffer file)
    (goto-char (point-min))
    (insert (format-time-string "%F %T %Z" nil t))
    (newline 2)

    ;; Give eww some time to load
    (eww site)
    (sit-for 2)

    ;; Jump to "Home Page" header
    (re-search-forward "^home page$")

    ;; Stories look like this in eww:
    ;;   <topic -- LINK>
    ;;   <story>
    ;;
    ;;   <headline>

    (dotimes (_ headline-count)
      ;; Navigate to headline
      (shr-next-link)
      (dotimes (_ 3)
        (forward-line))

      ;; Copy headline
      (set-mark-command nil)
      (move-end-of-line nil)
      (kill-ring-save t t t)
      (deactivate-mark)

      ;; Paste headline
      (switch-to-buffer file)
      (yank)
      (newline)
      (switch-to-buffer "*eww*"))

    ;; Save and prepare file for next invocation
    (switch-to-buffer file)
    (newline 2)
    (save-buffer file)))

To be clear, this is NOT elegant Elisp, and it definitely does stuff that would be inappropriate in a distributed package. It’s also brittle, as scrapers tend to be – if the Hurriyet website changed its format, I would have to dump it in the trash and start over. Nonetheless, it works fine for personal use.

Footnotes

1 hürriyet is a Turkish word derived from the Arabic حرية meaning freedom.

2 eww is short for Emacs Web Wowser. Really.

3 Note that keyboard macros are completely unrelated to Lisp macros.

4 kmacro-name-last-macro can be used in place of name-last-kbd-macro. Its output is a little different:

(fset 'hurriyet-scrape
   (lambda (&optional arg) "Keyboard macro." (interactive "p") (kmacro-exec-ring-item (quote ([tab 14 14 14 67108896 5 11 20 97 115 100 102 return 22 20 101 119 119 return] 0 "%d")) arg)))

This one uses numerical key codes, which I find hard to decipher (you can see 97 115 100 102 for asdf, for instance).