Scraper

class sunpy.util.scraper.Scraper(pattern, regex=False, **kwargs)[source]

Bases: object

A Scraper to scrap web data archives based on dates.

Parameters
  • pattern (str) – A string containing the url with the date encoded as datetime formats, and any other parameter as kwargs as a string format.

  • regex (bool) – Set to True if parts of the pattern uses regexp symbols. Be careful that periods matches any character and therefore it’s better to escape them. If regexp is used, other kwargs are ignored and string replacement is not possible. Default is False.

pattern

A converted string with the kwargs.

Type

str

now

The pattern with the actual date.

Type

datetime.datetime

Examples

>>> # Downloading data from SolarMonitor.org
>>> from sunpy.util.scraper import Scraper
>>> solmon_pattern = ('http://solarmonitor.org/data/'
...                   '%Y/%m/%d/fits/{instrument}/'
...                   '{instrument}_{wave:05d}_fd_%Y%m%d_%H%M%S.fts.gz')
>>> solmon = Scraper(solmon_pattern, instrument = 'swap', wave = 174)
>>> print(solmon.pattern)
http://solarmonitor.org/data/%Y/%m/%d/fits/swap/swap_00174_fd_%Y%m%d_%H%M%S.fts.gz
>>> print(solmon.now)  
http://solarmonitor.org/data/2017/11/20/fits/swap/swap_00174_fd_20171120_193933.fts.gz

Notes

The now attribute does not return an existent file, but just how the pattern looks with the actual time.

Methods Summary

filelist(timerange)

Returns the list of existent files in the archive for the given time range.

matches(filepath, date)

range(timerange)

Gets the directories for a certain range of time.

Methods Documentation

filelist(timerange)[source]

Returns the list of existent files in the archive for the given time range.

Parameters

timerange (TimeRange) – Time interval where to find the directories for a given pattern.

Returns

filesurls (list of str) – List of all the files found between the time range given.

Examples

>>> from sunpy.util.scraper import Scraper
>>> solmon_pattern = ('http://solarmonitor.org/data/'
...                   '%Y/%m/%d/fits/{instrument}/'
...                   '{instrument}_{wave:05d}_fd_%Y%m%d_%H%M%S.fts.gz')
>>> solmon = Scraper(solmon_pattern, instrument = 'swap', wave = 174)
>>> from sunpy.time import TimeRange
>>> timerange = TimeRange('2015-01-01','2015-01-01T16:00:00')
>>> print(solmon.filelist(timerange))  
['http://solarmonitor.org/data/2015/01/01/fits/swap/swap_00174_fd_20150101_025423.fts.gz',
 'http://solarmonitor.org/data/2015/01/01/fits/swap/swap_00174_fd_20150101_061145.fts.gz',
 'http://solarmonitor.org/data/2015/01/01/fits/swap/swap_00174_fd_20150101_093037.fts.gz',
 'http://solarmonitor.org/data/2015/01/01/fits/swap/swap_00174_fd_20150101_124927.fts.gz']

Note

The search is strict with the time range, so if the archive scraped contains daily files, but the range doesn’t start from the beginning of the day, then the file for that day won’t be selected. The end of the timerange will normally be OK as includes the file on such end time.

matches(filepath, date)[source]
range(timerange)[source]

Gets the directories for a certain range of time.

Parameters

timerange (TimeRange) – Time interval where to find the directories for a given pattern.

Returns

list of strList of all the possible directories valid for the time range given. Notice that these directories may not exist in the archive.