A G G R E G A T O R
All about Sports Aggregator
About Sports Aggregator
Sports Aggregator is an online service that simplifies the process of finding current news and other quality content about your favourite team. The internet is a big place, and we can help make it a bit smaller with the help of our web team, Canadian sports fans, and some computing power. Millions of people trust our sites to cut through all the noise and deliver the highest quality content about their team.
We're 100% Canadian owned and operated (and hosted too)
Sports Aggregator is based in Canada, run by Candians, and focused on Canadian professional sports teams. Our sites even reside on servers in Canada.
After years in the making, Sports Aggregator is proud to now be covering teams from coast-to-coast. We are thrilled to help Canadian sports fans find content about their favourite sports teams in Toronto (Blue Jays, Raptors, Maple Leafs, TFC), Montreal (Habs), Vancouver (Canucks), Calgary (Flames), Edmonton (Oilers), Winnipeg (Jets), and Ottawa (Senators). We aim to be your one-stop shop for Canadian sports content.
How it started
To make the first version of (then unnamed) Blue Jays Aggregator, around 2008, we took a bunch of RSS feeds from newspapers, national media and our favourite Jays blogs, and "piped" them into Yahoo Pipes (RIP). We used Yahoo Pipes to filter out all the non-Jays content with a very simple list of keywords, and to add authors to the display (not common at the time in feed readers) then added the new edited feed into Google Reader.
We found it quite useful and continued to add sources and refine the filtration rules. In 2010 we purchased bluejaysaggregator.com and built a website to share with some of our friends. They liked it a lot. Just a few years later, without a penny spent on promoting the site, thousands were coming everyday to Blue Jays Aggregator.
Eventually, limiting ourselves to RSS feeds wasn't going to cut it and we began developing a custom web crawler to discover the content we felt would comprise a more complete coverage of each team. We didn't want to even be perceived as a burden to the sites we crawl, so we developed a proprietary, unobtrusive method of crawling sites. We baked in an understanding of the team and sport as well as the nature of the content from each source we crawl with custom configurations. This allows us to massively throttle the crawler by making hyper-targeted checks, then do all the heavy lifting on our server.
We're more than just a list of links
While the homepage seems like a pretty simple list of links, a lot goes into determining what you see there. The real power of Sports Aggregatator lies in our proprietary crawler and vetting system.
Our entire system is custom built from the ground up, fundamentally designed to understand the sports media landscape, to find quality content, promote local journalism and independent media while respecting content creators and the sites we crawl. We are obsessed with quality, attribution, thoroughness, trust, simplicity and unobtrusiveness.
We insist on quality
Our system inherently finds and boosts original reporting and interesting editorial content while de-prioritizing re-published or regurgitated content. For our users this means less time spent clicking through to the same newswire articles at numerous sites (with different headlines), or re-written news coverage from sites that don’t add editorial value. For the sources we monitor, this means more readers!
High quality content is of the utmost importance to us. We build entire custom vetting systems around each team, and employ a team of administrators that constantly refine this system in order to display the highest quality content. Each day, thousands of items are discovered by our crawlers. Each item added to Sports Aggregator has been put through hundreds of quality checks for validity, uniqueness and quality. These processes have been refined and evolved over the past 10 years. Some are simple, others are very complicated, it all depends on the source, but together they represent a considerable burden for inclusion.
We love promoting local journalism and up and coming creators
Since 2010, we have been promoting local journalism and independent content creators covering their team in new and exciting ways. We emphasize original content, its authors, and their media outlets, especially local ones. We make it as simple and clear as possible and emphasize the original source of content.
How do we do it?
When we decide a source of content is good quality and of interest to our users, we start by building an extensive custom configuration file. During this process we map out each source in excruciating detail, develop a crawling plan, and set the burden for content inclusion. We ask ourselves numerous questions about the source and its content to build a knowledge base, and ultimately, to help shape the criteria we require to approve each item. Some of the questions include:
- Is there content about one of the teams we cover?
- What type of media outlet is it? Newspaper? Television network? Radio station?
- What type of content is it? Article? Video? Podcast? Radio clip?
- Is there an RSS feed? An API? What is the simplest path to discovering new content?
- What is the simplest and least intrusive approach to discovering new content at this source?
- When was the item published? Is there a valid pub date? What is the time zone?
- Who is the author? What is the show and/or network? Is there clear attribution?
- Does the content item have a unique identifier?
- Is it local journalism? Is it a columnist? Does it have editorial value?
- Are the writers/creators employed by a media outlet?
- Is the content from a newswire service?
- Is content published on more than one site but with different titles? Conflicting attribution?
- Is the content behind a paywall?
- What is the length or duration of the content?
- Does the content add value or is it aggregating existing content?
- What is the nature of the content? Breaking news? Statistical analysis? Opinion? Criticism?
- Is the content a slide show or another low-quality format?
So each custom configuration includes a path to discover new content, plus some combination of criteria, informed by the results of questions like the ones above, which must be met before a specific item is added to our site. Some setups are simple, and some are extremely complicated. It all depends on the nature of the content, and the nature of the data discovered.
While some of the above questions may seem obvious or unnecessary, each provides one part of a complex equation that seeks to make an assessment of the subject matter and quality. A unique identifier may not register in your day-to-day web browsing, but for us it can potentially indicate unique, duplicate, or altered/updated content. An article with a clearly defined author is much more likely to be useful content. If one episode of a podcast is 3 minutes, and another is 60 minutes, we can usually assume the longer episode will be of more interest to our users.
We don't burden the sites we crawl
We always seek out the least obtrusive means for discovering new content; it's part of the configuration and roadmap described above. Most discovery engines work by crawling a page, accumulating all the links on that page, and then crawling those pages, and so on. In our never-ending desire to be unobtrusive, we instead employ a laser-focused plan to retrieve new content as directly and simply as possible. If we crawl more than one page/data source at any site, we leave at least a full 2 minutes (and typically closer to one full hour) before we crawl circle back. Crawling etiquette typically dictates throttling pageloads by some amount of time, often 1-2 seconds. We insist on a timeframe that is over 100X longer.
We also actively engage with and honour the wishes of those who create the content we index. Over the years, we have interacted with countless content creators, changing descriptions or naming conventions so that they would be happy with how their content is displayed on our site.
We don't accept payment for preferred placement
In short, we don't accept funds in any way that could compromise our objectivity. We don't offer sponsored posts or placements of any kind. We use third-party advertising services to sell our advertising inventory, but in no way do these relationships have any bearing on the order, placement or ranking of any content discovered by our crawlers and indexed at one of the sites in our network.