
Projects By LANGUAGE
Libraries
Articles & seminars
Source Code

|
HTML Screen Scraping using C# .Net WebClient |
|
|
What is Screen Scraping ?
|
|
|
Screen Scraping means reading the contents of a web page.Suppose you go to yahoo.com,what you see is the interface which includes buttons,links,images etc.What we don't see is the target url of the links,the name of the images,the method used by the button which can be POST or GET.In other words we don't see the HTML behind the pages.Screen Scraping pulls the HTML of the web page.This HTML includes every HTML tag that is used to make up the page.
|
|
Why use screen scraping ? |
|
|
|
|
Displaying a web page on your own page using Screen Scraping :
| |
|
Lets see a small code snippet which you can use to display any page on your own page.First make a small interface as I have made below.As you can see the interface is quite simple.It has a button which says "Display WebPages below" and the web page trust me or not will be displayed in place of label.All the code will be written for the Button Click event.Below you can see the "Button Click Code". |
|
|
|
|
|
private void Button1_Click(object sender, System.EventArgs e) { WebClient webClient = new WebClient(); const string strUrl = "http://www.yahoo.com/"; byte[] reqHTML; reqHTML = webClient.DownloadData(strUrl); UTF8Encoding objUTF8 = new UTF8Encoding(); lblWebpage.Text = objUTF8.GetString(reqHTML); } |
|
|
Extracting Urls :
|
|
|
The first thing you need to extract all the Urls from the web page is the regular expression.
|
|
|
Regular Expression for Extracting Urls :
|
|
|
First you need to introduce System.Text.RegularExpressions.Next you need to make a regular expression that can extract all urls from the generated HTML.Your regular expression would like this:
|
|
|
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]* ))");
|
|
|
This just says that extract everything from the web page source which starts with "href\\".
|
|
|
User Interface in Visual Studio .Net:
|
|
|
We are keeping user interface pretty simple.It consist of a textbox,datagrid and button.The datagrid will be used to display all the extracted urls.Here is a screen shot of the User Interface.
|
|
|
The Code:
|
|
|
Okay the code is implemented in the button click event.But before that lets see the important declarations.You also need to include the following namespaces:
|
|
|
| |
|
System.Net;
System.Text; System.IO // If you plan to write in a file // creates a button protected System.Web.UI.WebControls.Button Button1; // creates a byte array private byte[] aRequestHTML; // creates a string private string myString = null; // creates a datagrid protected System.Web.UI.WebControls.DataGrid DataGrid1; a textbox protected System.Web.UI.WebControls.TextBox TextBox1; creates the label protected System.Web.UI.WebControls.Label Label1; // creates the arraylist private ArrayList a = new ArrayList(); |
|
Okay now lets see some button click code that does the actual work. |
|
|
private void Button1_Click(object sender, System.EventArgs e) { // make an object of the WebClient class WebClient objWebClient = new WebClient(); // gets the HTML from the url written in the textbox aRequestHTML = objWebClient.DownloadData(TextBox1.Text); // creates UTf8 encoding object UTF8Encoding utf8 = new UTF8Encoding(); // gets the UTF8 encoding of all the html we got in aRequestHTML myString = utf8.GetString(aRequestHTML); // this is a regular expression to check for the urls Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]*))"); // get all the matches depending upon the regular expression MatchCollection mcl = r.Matches(myString); foreach(Match ml in mcl) { foreach(Group g in ml.Groups) { string b = g.Value +" "; // Add the extracted urls to the array list a.Add(b); } } // assign arraylist to the datasource DataGrid1.DataSource = a; // binds the databind DataGrid1.DataBind(); // The following lines of code writes the extracted Urls to the file named test.txt StreamWriter sw = new StreamWriter(Server.MapPath("test.txt")); sw.Write(myString); sw.Close(); } |
|
The MatchCollection mc1 has all the extracted urls and you can iterate through the collection to get all of them.Once you enter the url in the textbox and press the button the datagrid will be populated with the extracted urls. |
