Welcome to blog of Emmela’s.
Been in the industry for 8 years till date thought to put all my experiences / thoughts and share with you all.
In Aug 2007 , that was the first time we thought to crawl / fetch data from different websites for analytics purpose. We started working around RSS FEEDS. Extracted and parsed data using php libraries.
Today in Dec 2016, am leading a team of 60 python engineers scraping the data across the globe in 14 different languages, from 40 odd countries and built in-house framework. We work around the Scrapy framework with in-house wrapper on the top.
From RSS feed we have grown to a level where we are extracting data:
- From complex websites
- Login based sources
- HTML / JSON / XML or any format of webpage
- Offline – Excel / Docs / PDF
- Email extraction