June 11, 2016

Scraping Youtube Haikus from Reddit

While browsing Reddit the other day, I stumbled upon /r/youtubehaiku. As with any seemingly popular subreddit that I discover, I decided to check out the top posts. The next day I showed the subreddit to my coworker and jokingly suggested that he write a script to scrape the links from Reddit and create a YouTube playlist from them. We spent the next couple of hours doing just that. Here's how I wrote mine.

The Setup

I used the app template as the base of the project by running the following command in the terminal:

lein new app youtubehaikus

I knew I'd need dependencies for making http requests, encoding/decoding JSON, and URL parsing so I added the relevant libraries to my project.clj

:dependencies [[org.clojure/clojure "1.8.0"]
               [clj-http "2.2.0"]
               [cheshire "5.6.1"]
               [com.cemerick/url "0.1.1"]]

Next, I created a new playlist on YouTube and created an OAuth 2.0 access token. These tokens expire after 60 minutes.

Getting the Reddit Post Data

This is the Reddit API endpoint that we want to use to get the post data. To get the first 100 top posts of all time, we need to specify the query params t=all and limit=100 in our request. If you specify the after=FULLNAME param then you can retrieve data about posts that come after a certain post. I be using this optional parameter in order to scrape more than just the top 100 posts of all time.

We'll start by defining our endpoint as follows:

(def reddit-url "https://www.reddit.com/r/youtubehaiku/top.json?t=all&limit=100")

The body of the response looks something like this:

{:kind "Listing",
 :data {:modhash "",
        :children [...]}}

The post data is stored under the :children key so we'll want to retrieve that:

(-> reddit-url
    (http/get {:headers {"User-Agent" "thing by /u/me"
                         "Accept"     "application/json"}})
    (json/parse-string true)
    (get-in [:data :children]))

Next, we want to extract the YouTube video ids from the links in the posts. Each object in the :children vector is structured like so:

{:kind "t3",
  :data { ...
         :url "..."
         :name "t3_..."}}

and we'll obtain the video ids as follows:

(defn get-path [{:keys [host path query]}]
  (if (= "youtu.be" host)
    (subs path 1)
    (or (get query "v")
        (get query "amp;v"))))

(map #(-> % (get-in [:data :url]) url get-path) post-data)

Creating the Playlist

The last step is to iterate through the video ids and add them to our playlist using the YouTube API.

However, I wanted to scrape the top 500 videos so I did that by using loop/recur and the after query param in the Reddit endpoint. The FULLNAME that you want to use as the after param is the value associated with the :name key in the last entry of the post data.

Putting it all together, this was my entire namespace:

(ns youtubehaikus.core
  (:require [clj-http.client :as http]
            [cheshire.core :as json]
            [cemerick.url :refer [url]]))

(def reddit-url "https://www.reddit.com/r/youtubehaiku/top.json?t=all&limit=100")

(defn get-path [{:keys [host path query]}]
  (if (= "youtu.be" host)
    (subs path 1)
    (or (get query "v")
        (get query "amp;v"))))

(defn -main [& args]
  (loop [reddit-url reddit-url
         pages      5]
    (let [post-data    (-> reddit-url
                           (http/get {:headers {"User-Agent" "thing by /u/me"
                                                "Accept"     "application/json"}})
                           (json/parse-string true)
                           (get-in [:data :children]))
          last-post-id (get-in (last post-data) [:data :name])
          haiku-ids    (map #(-> % (get-in [:data :url]) url get-path) post-data)]
      (doseq [id haiku-ids]
        (try (http/post "https://www.googleapis.com/youtube/v3/playlistItems"
                        {:query-params {:access_token "ACCESS_TOKEN"
                                        :part         "snippet"}
                         :content-type :json
                         :body         (-> {:snippet {:playlistId "PLAYLIST_ID"
                                                      :resourceId {:kind    "youtube#video"
                                                                   :videoId id}}}
             (catch Exception _)))
      (when-not (= 1 pages)
        (recur (str reddit-url "&after=" last-post-id)
               (dec pages))))))

You can run this from the root directory of the project on the command line by doing lein run or from your repl.

I wrapped the post in a try/catch block that eats the exception because I was getting some bad responses when attempting to add some of the videos to my playlist - I think it was because some of the videos no longer exist. The link to my playlist is here.

Tags: clojure