Video API¶

class vidscraper.videos.Video(url, loaders=None, fields=None)¶

This is the class which should be used to represent videos which are returned by suite scraping, searching and feed parsing.

Parameters:	url – The “pasted” url for the video. loaders – An iterable of `VideoLoader` instances that will be used to load video data. fields – A list of fields which should be fetched for the video. This will be used to optimize the fetching process. Other fields will not populated, even if the data is available.

link = None¶: The canonical link to the video. This may not be the same as the url used to initialize the video.

guid = None¶: A (supposedly) global identifier for the video

index = None¶: Where the video was in the feed/search

title = None¶: The video’s title.

description = None¶: A text or html description of the video.

publish_datetime = None¶: A python datetime indicating when the video was published.

files = None¶: A list of VideoFile instances representing all the possible files for this video.

flash_enclosure_url = None¶: “Crappy enclosure link that doesn’t actually point to a url.. the kind crappy flash video sites give out when they don’t actually want their enclosures to point to video files.”

embed_code = None¶: The actual embed code which can be used for displaying the video in a browser.

thumbnail_url = None¶: The url for a thumbnail of the video.

user = None¶: The username associated with the video.

user_url = None¶: The url associated with the video’s user.

tags = None¶: A list of tag names associated with the video.

license = None¶: A URL to a description of the license the Video is under (often Creative Commons)

is_embeddable = None¶: Whether the video is embeddable? (Youtube, Vimeo)

missing_fields¶: Returns a list of fields which have been requested but which have not been filled with data.

load()¶: If the video hasn’t been loaded before, runs the loaders and populates the video’s fields.

get_best_loaders()¶

Returns a list of loaders from loaders which can be used in combination to fill all missing fields - or as many of them as possible.

This will prefer the first listed loaders and will prefer small combinations of loaders, so that the smallest number of smallest possible responses will be fetched.

run_loaders()¶: Runs get_best_loaders() and then gets data from each loader.

items()¶: Iterator over (field, value) for requested fields.

serialize()¶: Serializes the video as a python dictionary containing the original url and fields used to initialize the video, as well as the value of each field on the video. Since loaders are intended to be provided by suites and include sensitive information (api keys), they are not serialized.

classmethod deserialize(data, api_keys=None)¶

Given a data dictionary such as would be provided by serialize() and, optionally, api keys, constructs a Video instance for the url and fields in the data, with field values prepopulated from the dictionary.

Parameters:	data – A dictionary as would be provided by `serialize()`. api_keys – `None`, or a dictionary of API keys to instantiate the deserialized video with.

get_file(preferred_mimetypes=('video/webm', 'video/ogg', 'video/mp4'))¶: Returns the preferred file from the files for this video. Vidscraper prefers open formats and well-compressed formats. If no file mimetypes are known, the first file will be returned.

class vidscraper.videos.VideoFile(url, expires=None, length=None, width=None, height=None, mime_type=None)¶

Represents a video file hosted somewhere. The only required attribute is the file’s url. There are also several optional metadata attributes, which represent what is claimed about the video by the data provider, not necessarily what is actually true about the video.

url = None¶: The URL of this video file.

expires = None¶: When the URL for this file expires, if at all.

length = None¶: The size of the file, in bytes.

width = None¶: The width of the video, in pixels.

height = None¶: The height of the video, in pixels.

mime_type = None¶: The MIME type of the video.

serialize()¶: Serializes the VideoFile as a python dictionary.

classmethod deserialize(data)¶: Given a data dictionary such as would be provided by serialize(), constructs a VideoFile instance.

class vidscraper.videos.VideoLoader(url, api_keys=None)¶

This is a base class for objects that fetch data for a video, for example from an API or a page scrape.

Parameters:	url – The “pasted” url for which data should be loaded. api_keys – A dictionary of API keys which may be needed to load data with this loader.

fields = set([])¶: A set of fields this loader believes it can provide.

url_format = None¶: A format string which, paired with url_data, returns a suitable url for this loader to fetch data from.

timeout = 3¶: The number of seconds before this loader times out. See python-requests documentation for more information.

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux) Safari/536.10 vidscraper/(1, 0, 2)'}¶: Extra headers to set on the requests for this loader. See python-requests documentation for more information.

get_url_data(url)¶

Parses the url into data which can be used to construct a url this loader can use to get data.

Raises :	`UnhandledVideo` if the url isn’t handled by this loader.

get_url()¶: Returns a url which can be fetched to get a response that this loader can process into data.

get_headers()¶: Returns a dictionary of headers which will be added to the request. By default, this is a copy of headers.

get_request_kwargs()¶: Returns the kwargs used for making an HTTP request for this loader (with python-requests).

get_video_data(response)¶: Parses the given response and returns a data dictionary for populating a Video instance. By default, returns an empty dictionary.

class vidscraper.videos.OEmbedLoaderMixin¶

Mixin to provide basic OEmbed functionality. Subclasses need to provide an endpoint, define a get_url_data method, and provide a url_format - for the video URL, not the oembed API URL.

This is provided as a mixin rather than a subclass of VideoLoader so that it can be used on top of any class or mixin that overrides VideoLoader.get_url().

endpoint = None¶: The endpoint for the OEmbed API.

class vidscraper.videos.VideoIterator(start_index=1, max_results=None, video_fields=None, api_keys=None)¶

Generic base class for iterating over groups of videos spread across multiple urls - for example, an rss feed, an api response, or a video list page.

Parameters:

start_index (integer >= 1) – The index of the first video to return. Default: 1.
max_results – The maximum number of videos to return. If this is None (the default), as many videos as possible will be returned.
video_fields –
A list of fields to be fetched for each video in the iterator. Limiting this may decrease the number of HTTP requests required for loading video data.

See also

Limiting metadata
api_keys – A dictionary of API keys for various services. Check the documentation for each suite to find what API keys they may want or require.

per_page = None¶: Describes the number of videos expected on each page. This should be set whether or not the number of videos per page can be controlled.

page_url_format = None¶: A format string which will be used to build page urls for this iterator. This should use the {} format described under str.format() in the python docs: http://docs.python.org/library/stdtypes.html#str.format

timeout = 3¶: The number of seconds before this loader times out. See python-requests documentation for more information.

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux) Safari/536.10 vidscraper/(1, 0, 2)'}¶: Extra headers to set on the requests for this loader. See python-requests documentation for more information.

load()¶: Loads a response if one is not already loaded and tries to extract data from it with data_from_response().

get_response_items(response)¶: Returns an iterable of unparsed items for the response.

get_video_data(item)¶: Parses a single item for the feed and returns a data dictionary for populating a Video instance. By default, returns an empty dictionary. Raises InvalidVideo if the item is found to be invalid in some way; this causes the item to be ignored.

get_page_url(page_start, page_max)¶: Builds and returns a url for a page of the source by putting the results of get_page_url_data() into page_url_format.

get_page_url_data(page_start, page_max)¶: Returns a dictionary which will be combined with page_url_format to build a page url.

get_headers()¶: Returns a dictionary of headers which will be added to the request. By default, this is a copy of headers.

get_request_kwargs()¶: Returns the kwargs used for making an HTTP request for this feed.

get_page(page_start, page_max)¶: Given a start and maximum size for a page, fetches and returns a response for that page. The response could be a feedparser dict, a parsed json response, or even just an html page.

data_from_response(response)¶: Given a response as returned from get_page(), returns a dictionary of metadata about this iterator. By default, returns an empty dictionary.

class vidscraper.videos.FeedparserVideoIteratorMixin¶: Overrides the get_page(), data_from_response() and get_response_items() to use feedparser. get_video_data() must still be implemented by subclasses.

class vidscraper.videos.BaseFeed(url, last_modified=None, etag=None, **kwargs)¶

Represents a list of videos which can be found at a certain url. The source could easily be an RSS feed, an API response, or a video list page.

In addition to the parameters for VideoIterator, this class takes the following arguments:

Parameters:

url – A url representing a feed page.
last_modified – The last known modification date for the feed. This can be sent to the service provider to try to short-circuit fetching and/or loading a feed whose contents are already known.
etag –
An etag which can be sent to the service provider to try to short-circuit fetching a feed whose contents are already known.

See also

http://en.wikipedia.org/wiki/HTTP_ETag

Raises :

UnhandledFeed if the url can’t be handled by the class being instantiated.

BaseFeed also supports the following “fields”, which are populated with data_from_response(). Fields which have not been populated will be None.

video_count¶: The estimated number of videos for the feed.

last_modified¶: A python datetime representing when the feed was last changed. Before loading the feed, this will be equal to the last_modified date the BaseFeed was instantiated with.

etag¶: A marker representing a feed’s current state. Before loading the feed, this will be equal to the etag the BaseFeed was instantiated with.

description¶: A description of the feed.

webpage¶: The url for an html, human-readable version of the feed.

title¶: The title of the feed.

thumbnail_url¶: A URL for a thumbnail representing the whole feed.

guid¶: A unique identifier for the feed.

get_url_data(url)¶

Parses the url into data which can be used to construct page urls.

Raises :	`UnhandledFeed` if the url isn’t handled by this feed.

class vidscraper.videos.BaseSearch(query, order_by='relevant', **kwargs)¶

Represents a search on a video site. In addition to the parameters for VideoIterator, this class takes the following arguments:

Parameters:	query – The raw string for the search. order_by – The ordering to apply to the search results. If a suite does not support the given ordering, it will return an empty list. Possible values: `relevant`, `latest`, `popular`. Values may be prefixed with a “-” to indicate descending ordering. Default: `relevant`.
Raises :	`UnhandledSearch` if the class doesn’t support the given parameters.

BaseSearch also supports the following “fields”, which are populated with data_from_response(). Fields which have not been populated will be None.

video_count¶: The estimated number of total videos for this search.

order_by_map = {'relevant': 'relevant'}¶: Dictionary mapping our order_by options (relevant, latest, and popular) to the service’s equivalent term. If an order_by option is not in this dictionary, it is assumed not to be supported by the service.