Have working code that extracts the code and makes lists from it (with the html tags still attached for now):
#A python program to import tide data from a gov.je website
#tidescrape1.0.py - working fine
#It pulls the data in from the gov.je tide site, wkhich is updated daily
#It looks for the class headers associated with date,time and height information
#and then creates a list of these bits of html
#next step - try to extract just the data from current day and tweet it.
import urllib2
import re
from bs4 import BeautifulSoup
#open site
rawhtml = urllib2.urlopen("http://www.gov.je/Weather/Pages/Tides.aspx").read(20000)
soup = BeautifulSoup(rawhtml)
#from http://stackoverflow.com/questions/14257717/python-beautifulsoup-wildcard-attribute-id-search
#get the dates:
tidedates = soup.findAll('td', {'class': re.compile('TidesDate.*')} )
print (tidedates[0])
#get the times:
tidetimes = soup.findAll('td', {'class': re.compile('TidesTime.*')} )
print (tidetimes[0])
#get the heights:
tideheights = soup.findAll('td', {'class': re.compile('TidesHeight.*')} )
print (tideheights[0])
Output looks like this:
<td class="TidesDate Weekend">Sunday 9 August</td>
<td class="TidesTime Weekend"><span style="color:#cc0000;">01:57</span><br/>08:42<br/>14:33<br/>21:27<br/></td>
<td class="TidesHeight Weekend"><span style="color:#cc0000;">8.5m</span><br/>3.6m<br/>8.4m<br/>3.7m<br/></td>
Next step is to somehow strip the text out.
No comments:
Post a Comment