Web Scraping and Process Automation

Emmela’s Web Scraping Services Include:

Extraction of publicly and allowed Online Data
Unstructured to Structured data conversion
Making the extracted data available as API/ JSON ready for consumption or Web Data Integration.

Want to know more on our

WEB SCRAPING SERVICES?

Emmela is Happy to Assist. Happy Data. Cheers!!!

Get in touch!

+91 9632 30 66 00 | raja@emmela.com

Our Web Scraping process

Analysis
1. Check the legality of web scraping
2. Identify ways to crawl if allowed.
3. Check for different patterns in the data
4. Usually takes a couple of hours to complete the whole analysis
Scraper
1. As per the analysis write the crawler.
2. Time estimate based on the type of crawlers in days
  1. Simple: 2 – 3
  2. Medium:2 – 4
  3. Complex: 3 – 5
Data
1. Dump in DB such that the application consumes OR
2. Upload in the given FTP location OR
3. Create web-based solutions on Flask with JSON as a response.
Sanity
1. Checks the health of crawler runs and data extracted.
Schedule
1. Schedule the crawl to run as per the required frequency
Run-time Scrapers
1. Get the system ready to get the data extracted by running crawler at any given time for a set of desired parameters.
Integration
1. Scraper added to the platform for consumption
Manual Monitoring
1. Quality check enabled at all levels.

Classification of Web Sources

Simple Crawlers
Is a plain HTML web page or an RSS feed
Has all the data to be extracted available on one page.
No Navigation needed to crawl the data. I.e., all the requested data is available on the same page.
Very less pagination or volumes of data
SIMPLE

Medium Crawlers
Has all or a set of features of Simple
Has 1-2 levels of navigation or a simple form-based approach
Email crawling
Source has demographic restrictions, where we need to use respective proxies.
Has forms to fill before extracting data
MEDIUM

Complex Crawlers
Has all or a set of features of Medium
Has nested navigation to extract data
Login based crawlers
Displays captcha to extract data
Server frequently throws errors or blocks the IP Address to crawl the data.
Has a lot of pages to crawl – huge pagination
PDF/ IMAGE crawling [ Based on the resolution, additional manual efforts may be needed. ]
Websites that run on Javascript/ using Angularjs / IFrames
Complex

Web Scraping Services FAQs

What about login based crawling
1. Yes, we can handle. Few of them can be straightforward else based on passing cookies we crawl them.
Reg privacy
1. We care for it. We sign NDA to start.
How do you check the correctness of the crawling?
1. We write sanity reports and alert emails wherever needed. Understand the response frequency & patterns if any. Based on the inputs Sanity system will be designed. Also, sync with the customer team for domain knowledge and take inputs to be more robust. The initial days shall be more with exchanging emails and calls to get the whole Scraping Ecosystem in place.
How do we scale crawling?
1. We can add additional team members to get done on an emergency. We have a good team to work. An alert to be given beforehand helps us to keep ready additional resources. The ecosystem is customised and built as per the individual client requirements. Ecosystem helps in adding new crawlers to any scale.
How are the timelines?
1. Timelines are based on the complexity of the crawler. It ranges from a day to 5 working days at max. If it takes longer than expected, we’ll keep you informed in the analysis phase.
How do you determine the complexity of the crawling?
1. We categorise websites into 3 types: Simple, Medium and Complex. More details are given at 2.1Time taken to write the crawlers in all the above 3 cases is almost the same. The additional time/ efforts are to be put in the analysis; understanding the website, patterns and data visibility.
Do you crawl all the websites?
1. Yes, we attempt to crawl all the websites that have data publicly available or is allowed to crawl.
  1. This can be determined by checking robots.txt. This way we are legally entitled to crawl. If it is mentioned ‘disallow’ we convey the same to the client.
How do you handle multiple requests to a particular website without blocking?
1. We do impose IP Rotation.
  1. We have multiple IPs that fire the requests with random IP. Does IP rotation usage incur additional costs?Not all times. Depends on the volume of data and type of website.
What about infra?
1. We can set up internally the machines and crawl the data. The crawled data is being pushed to client machines
  1. As a CSV file in an email daily ORVia FTPPush data in the respective MySQL DB.
Captcha
1. We solve captcha by our techniques of reading the image. At times we take the help of paid services like DBC or other APIs that are available. If any paid service is incorporated the charges are levied upon the client. There are chances where we may not solve the captcha. We convey the same beforehand.
How do you charge?
1. We have two types:
  1. Type 1: Charge hourly basis
    Adv: Pay as per the crawlers developed & maintained.
    Conditions: Minimum hours of work needs to assigned per month.
  2. Type 2: Dedicated Full-time resource
    Dedicated to the client and acts as an extended team for the customer
    Economical
    If the dedicated person is on leave, our team takes care of KT or additional support without break.
    Domain knowledge can be built that helps in monitoring sanity systems
  3. Type 3:This is the combination of Type 1 & 2. Clients can opt a full-time resource and have additional resources at emergency or priority/ demo times hourly basis.

Solutions and Case Studies

Extraction of Video MetaData
1. Requirement
  1. Movie/ Tv Show/ EPG metadata needed to build Voice search in TV remote
2. Solution
  1. From popular and rich movie & tv show content related websites, extracted entertainment metadata
  2. The extracted metadata is cleaned and gaps filled to create a consumable database pumping metadata for voice search.
3. Technology
  1. Python. Scrapy, Selenium
4. Database: MySql
Unified Requirements Portal
1. Requirement
  1. Has approximately 60 clients to track the retail requirements which are given VIAAuthenticated login based web portal OR An Email with attachments or data in the bodyThe said 60 clients do post the requirements round the clock. Manually it is a tough job to log in each website now and then to note the requirements.
    A unified portal is needed where all the requirements are displayed and even tracked down.
2. Solution
  1. Written web scrapers to log in and check for the latest data or new requirements.
  2. If available the scrapers extract to push the data into a common database with a timestamp and necessary details.
  3. Written crawlers to read email, download the attachments.
  4. The extracted data is pushed to a database.
  5. Built a simple UI to display all the fetched data at one place and a tracking workflow too.
  6. Generate reports detailing the number of requirements with the status of completion daily.
  7. Normalised the parameters used across the 60 clients while dumping into a database for a uniform display and clear understanding of the entities. This happens as each client has its own naming convention.
3. Technology
  1. Web Crawling
    1. Python, Scrapy, Selenium
  2. Web Development
    1. Python, Django
Booking Tickets Automation
1. Requirement
  1. For a flight/ Bus aggregator to book or cancel tickets by users, should be happening on the go seamlessly avoiding manual efforts.Not all AIR/ BUS service providers do give API.
2. Solution
  1. Written Scrapers to book and cancel the tickets.
    If API is available used it else programmatically logged in for the respective data. Scaled the systems such that the traffic can be handled while requesting the service provider websites.
  2. Designed a UI that gives daily stats. Email alerts are in place if the automated pipeline breaks for manual intervention until the system gets restored.
3. Technology
  1. Web Crawling
    1. Python, Scrapy, Selenium
  2. Automation & Web UI
    1. Selenium, Python -Flask & Django
Social Media Analytics
1. Requirement
  1. To capture comments/ feedback of customers online for a particular brand
2. Solutions
  1. Crawl most of the popular online news publishing websites in the given country/ region
  2. Identify and crawl all the forums w.r.to the industry with the help of the client.
  3. Read tweets, FB posts, blogs, and all other social media content
  4. Crunch data to generate analytics and share reports
3. Technology
  1. Web Crawling
    1. Python – Scrapy, Selenium

Technology Expertise

Programming Languages
1. Python
Database
1. MySql, MongoDB, Elastic Search
Frameworks
1. Scrapy, Selenium, Flask, Django
Environment
1. Ubuntu

Communication

Available on call for any quick updates
We share weekly status Email or at the frequency as discussed and agreed between us and clientSkype/ Zoom calls
If the client location is in Bangalore, one of our team members are happy to visit for any F2F meetings at times.

Interested in our