Academic Students Projects | Software School Projects | Free Source Codes | College
Projects By LANGUAGE
Libraries
Articles & seminars
Source Code
HTML Screen Scraping using C# .Net WebClient
What is Screen Scraping ?
Screen Scraping means reading the contents of a web page.Suppose you go to yahoo.com,what you see is the interface which includes buttons,links,images etc.What we don't see is the target url of the links,the name of the images,the method used by the button which can be POST or GET.In other words we don't see the HTML behind the pages.Screen Scraping pulls the HTML of the web page.This HTML includes every HTML tag that is used to make up the page.
Why use screen scraping ?
The question that comes to our mind is why do we ever want the HTML of any web page.Screen Scraping does not stop only on pulling out the HTML but displaying it also.In other words you can pull out the HTML from any web page and display that web page on your page.It can be used as frames.But the good thing about screen scraping is that it is supported by all browsers and frames unfortunately are not.Also sometimes you go to a website which has many links which says image1,image2,image3 and so on.In order to see those images you have to click on the image and it will enlarge in the parent or the new window. By using screen scraping you can pull all the images from a particular web page and display them on your own page.
Displaying a web page on your own page using Screen Scraping :
Lets see a small code snippet which you can use to display any page on your own page.First make a small interface as I have made below.As you can see the interface is quite simple.It has a button which says "Display WebPages below" and the web page trust me or not will be displayed in place of label.All the code will be written for the Button Click event.Below you can see the "Button Click Code".
private void Button1_Click(object sender, System.EventArgs e)
{
WebClient webClient = new WebClient();
const string strUrl = "http://www.yahoo.com/";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
lblWebpage.Text = objUTF8.GetString(reqHTML);
}
Extracting Urls :
The first thing you need to extract all the Urls from the web page is the regular expression.
Regular Expression for Extracting Urls :
First you need to introduce System.Text.RegularExpressions.Next you need to make a regular expression that can extract all urls from the generated HTML.Your regular expression would like this:
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]* ))");
This just says that extract everything from the web page source which starts with "href\\".
User Interface in Visual Studio .Net:
We are keeping user interface pretty simple.It consist of a textbox,datagrid and button.The datagrid will be used to display all the extracted urls.Here is a screen shot of the User Interface.
The Code:
Okay the code is implemented in the button click event.But before that lets see the important declarations.You also need to include the following namespaces:
System.Net;
System.Text;
System.IO // If you plan to write in a file
// creates a button protected System.Web.UI.WebControls.Button
Button1; // creates a byte array private byte[] aRequestHTML; //
creates a string private string myString = null; // creates a datagrid
protected System.Web.UI.WebControls.DataGrid DataGrid1;
a textbox protected System.Web.UI.WebControls.TextBox TextBox1;
creates the label protected System.Web.UI.WebControls.Label Label1;
// creates the arraylist private ArrayList a = new ArrayList();
Okay now lets see some button click code that does the actual work.
private void Button1_Click(object sender, System.EventArgs e)
{
// make an object of the WebClient class
WebClient objWebClient = new WebClient();
// gets the HTML from the url written in the textbox
aRequestHTML = objWebClient.DownloadData(TextBox1.Text);
// creates UTf8 encoding object
UTF8Encoding utf8 = new UTF8Encoding();
// gets the UTF8 encoding of all the html we got in aRequestHTML
myString = utf8.GetString(aRequestHTML);
// this is a regular expression to check for the urls
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]*))"); 
// get all the matches depending upon the regular expression
MatchCollection mcl = r.Matches(myString);
foreach(Match ml in mcl)
{
foreach(Group g in ml.Groups)
{
string b = g.Value +"
"; 
// Add the extracted urls to the array list
a.Add(b);
}
}
// assign arraylist to the datasource
DataGrid1.DataSource = a;
// binds the databind
DataGrid1.DataBind();
// The following lines of code writes the extracted Urls to the file named test.txt
StreamWriter sw = new StreamWriter(Server.MapPath("test.txt"));
sw.Write(myString);
sw.Close();
}
The MatchCollection mc1 has all the extracted urls and you can iterate through the collection to get all of them.Once you enter the url in the textbox and press the button the datagrid will be populated with the extracted urls.