Showing posts with label Web Crawler By Alex Xu. Show all posts
Showing posts with label Web Crawler By Alex Xu. Show all posts

Saturday, September 3, 2022

System Design - Web Crawler By Alex Xu

What is a web crawler? 
Constantly keep scanning the webpage and visit the link to each mentioned on each web page. 

 Why crawler is used? 
For search engines as a popular use, Web archiving and Web mining, web monitoring is another purpose. The complexity of designing a web scaler is either how big or small the website is! Understand the scope of the problem 
1. Given a set of URLs, download all the webpages addressed by the URLs 
2. Extract URLs from these web pages. 
3. Add a new URL to the list of URLs to be downloaded, Repeat 3 steps 



  Good Question to the interviewer: 
 Functional 
- What is the main purpose? Where it is used? for what it is used? Search index
 - How many web pages are the web crawler per month? 1 billion 
- Type of Content? HTML - Newly added and edited webpage? 
- How long does store the content? Yes, up to 5 years 
- Duplicate content? Page with duplicate content should be ignored 
- Should be worried about crawlers going to all paths from a security perspective? 

 Non-functional 
- Scalability - be efficient and parallelization 
 - Robustness - how well crawler full to traps in a webpage, malicious webpage? 
- Politeness - should not make too many requests to a website within a short period of time 
- Extensibility -flexible so that minimal changes are needed to support new content types.
- Availability and consistency are not matter in this case since the real client is not making the request. 

Something around estimation





High level design 





  How to know the website is updated - A web crawler has the hash of the webpage. The crawler does the BFS in the tree node and looks for either metadata or content. These crawlers run every day and if the crawler sees anything changes then they would store. They put the expectation on how many children?  https://web.archive.org/ is a place to web archive all websites.


Bloom efficient data structure help to understand if URL is seen

Design deep dive. BFS and DFS are first used, However, DFS is usually a not good choice because the depth of DFS can be very depth.