Trying to grab all the names of high schools from the list of high schools in nyc wiki page.
I've written enough of the script to get me all of the info contained within the
<tr> tags of the table containing the list of high schools, academic area and entrance criteria - but how can I narrow that down to what I thought would rest within
td (which spits back a
KeyError) - just the name of the school?
Code I've written thus far:
from bs4 import BeautifulSoup from urllib2 import urlopen NYC = 'https://en.wikipedia.org/wiki/List_of_high_schools_in_New_York_City' html = urlopen(NYC) soup = BeautifulSoup(html.read(), 'lxml') schooltable = soup.find('table') for td in schooltable: print(td)
Output I receive:
<tr> <td><a href="/wiki/The_Beacon_School" title="The Beacon School">The Beacon School</a></td> <td>Humanities & interdisciplinary</td> <td>Academic record, interview</td> </tr>
Output I'm seeking:
The Beacon School
How about you get the first
table on the page, iterate over all rows, except the first header one, and get the first
td element for every row. Works for me:
for row in soup.table.find_all('tr')[1:]: print(row.td.text)