Expression Xpath to get an optional element in Scrapy

advertisements

By optional, I mean the element could not exist.

I have a spider for GitHub, I am trying to get the primary Language for a rep

<div class="repository-lang-stats">
    <ol class="repository-lang-stats-numbers">
      <li>
          <a href="/scrapy/scrapy/search?l=python">
            <span class="color-block language-color" style="background-color:#3581ba;"></span>
            <span class="lang">Python</span>
            <span class="percent">99.1%</span>
          </a>
      </li>
      <li>
          <span class="other">
            <span data-lang="Other" class="color-block language-color"></span>
            <span class="lang">Other</span>
            <span class="percent">0.9%</span>
          </span>
      </li>
    </ol>
</div>

In the example (source of this repo) above I need to get "Python", from first

<span class="lang">

But my problem is for some repo, like an empty one, there is no

<span class="lang">

tag, or

<ol class="repository-lang-stats-numbers">

tag. How do I get over this?


I'd go for finding the list of languages, take the first list item and retrieve the first span, jumping over possible anchor tags (they seem to be missing for some low-frequency languages).

//ol[@class="repository-lang-stats-numbers"]/li[1]//span[@class="lang"]

An empty result will indicate that no language data is available.

Some remarks:

  • To be more specific, you could prepend div[@class="repository-lang-stats"] as first axis step, but I don't think it will be necessary.
  • We're matching class attributes, watch out!
  • To return only the text value, append /text() to the query.

Anyway: Github offers an API that also lets you query repository languages. Better use this instead of scraping the site. APIs are fast, easy to use and stable; web sites are front end code that change often and will break your XPath queries.

You can query it by accessing a special URI (for example https://api.github.com/repos/scrapy/scrapy/languages) that will return a JSON object that can be easily parsed and sorted:

{
  "Shell": 1733,
  "Python": 1195439,
  "CSS": 9681
}