Obtaining multiple blocks of text between tags

advertisements

This is my HTML:

<div class="left_panel">
    <h4>Header1</h4>
      block of text that I want.
    <br />
    <br />
      another block of text that I want.
    <br />
    <br />
      still more text that I want.
    <br />
    <br />
      <p>&nbsp;</p>
    <h4>Header2</h4>

The number of blocks of text is variable, Header1 is consistent, Header2 is not.

I'm successfully extracting the first block of text using the following code:

def get_summary (soup):
raw = soup.find('div',{"class":"left_panel"})
for h4 in raw.findAllNext('h4'):
    following = h4.nextSibling
    return following

However I need all of the items sitting between the two h4 tags, I was hoping that using h4.nextSiblings would solve this, but for some reason that returns the following error:

TypeError: 'NoneType' object is not callable

I've been trying variations on this answer: Find next siblings until a certain one using beautifulsoup but the absence of a leading tag is confusing me.


Find the first header and iterate over .next_siblings until you hit an another header:

from bs4 import BeautifulSoup

data = """
<div class="left_panel">
    <h4>Header1</h4>
      block of text that I want.
    <br />
    <br />
      another block of text that I want.
    <br />
    <br />
      still more text that I want.
    <br />
    <br />
      <p>&nbsp;</p>
    <h4>Header2</h4>
</div>
"""

soup = BeautifulSoup(data)
header1 = soup.find('h4', text='Header1')
for item in header1.next_siblings:
    if getattr(item, 'name') == 'h4' and item.text == 'Header2':
        break

    print item


Update (collecting texts between two h4 tags):

texts = []
for item in header1.next_siblings:
    if getattr(item, 'name') == 'h4' and item.text == 'Header2':
        break

    try:
        texts.append(item.text)
    except AttributeError:
        texts.append(item)

print ''.join(texts)