Html Scraping using Selenium Automatic Web Browsing

Standard

Selenium is a set of tools across a range of languages (both supported and unsupported) to aid automatic testing of web applications.

The tools allow one to automatically browse a website and to, for instance, take screen shots without the need of user control.  This obviously means such tests can in principle be ran over night and the results checked the following day.

Selenium can also be used to automatically scrape html pages, for such reason as checking advertising content is correct over a given time period on a big website. Saving the html and comparing for changes is a way one ensure correct content is being displayed over the course of a week/month etc.  This would be particularly useful for a website with many content contributors (and a poor CMS), so accidental changes to html can (and usually) does occur.

There are a range of tools in the Selenium suite (both by Selenium and by third party developers), allowing testing to be done across most browsers.  The below outlines code for Internet Explorer and Phantom JS (explained individually).

// using the Internet Explorer Driver
  public InternetExplorerDriver createBrowserDriver()
 {
 // this sample was for a website with secure pages
 // so a login u/p was required
 string LOGIN_URL = loginUrl;
 string username = loginUserName;
 string password = loginPassword;
 // get the path to find the driver executable
 // downloaded as part of the Selenium package
 string path = AppDomain.CurrentDomain.BaseDirectory.ToString();
 // open IE to login to the website
 var options = new InternetExplorerOptions()
 {
 InitialBrowserUrl = LOGIN_URL,
 // Gets or sets a value indicating whether to ignore the settings of 
 // the Internet Explorer Protected Mode.
 IntroduceInstabilityByIgnoringProtectedModeSettings = true;
 };
 var driver = new InternetExplorerDriver(path + "drivers", options);
 
 return driver;
 }

The above returns an instance of the InternetExplorerDriver which is used to create an Internet Explorer ‘browser session’ and urls can be passed in to be browsed.  The below code does this.

 // the website this code was created for had a login screen with the destination url being used as a
 // querystring and being navigated to after successful log in, therefore a simple count can be used to tell
 // if login is necessary.
 count = count + 1;
 // find login page
 if (count == 1)
 {
 // if this is the first URL in the list, do login and add url to querystring
 driver2.Navigate().GoToUrl(LOGIN_URL + URL);
 driver2.FindElement(By.Id("loginID")).SendKeys(username);
 driver2.FindElement(By.Id("password")).SendKeys(password);
 driver2.FindElement(By.TagName("form")).Submit();
 }
 else
 {
 //otherwise reuse the IE logged in browser to navigate to the proceeding placements
 driver2.Navigate().GoToUrl(URL);
 }
 
 // get the page source
 htmlString = driver2.PageSource;

Using the InternetExplorerDriver will use IE to do the automated browsing.  This is good, and fully supported by Selenium, but sometimes seeing the browser working is not necessary, likely being a distraction if the machine is being used for something else.

PhantomJSDriver is a headless WebKit; basically a browser without a front end.  This means one can do automated browser testing without a browser window popping up; perfect if the app is intended to be used by someone that doesn’t need to see a browser in action, and more to the point, needs to use their machine for other work.

The code for PhantomJSDriver is below.

public PhantomJSDriver createBrowserDriver()
 {
 string URL = "http://www.google.com";
 string path = AppDomain.CurrentDomain.BaseDirectory.ToString();
 
 var driver2 = new PhantomJSDriver(path + "drivers");
 
 return driver;
 }

There are no options that need passing, so this method is a lot cleaner.  The driver can be used in the same way as the InternetExplorerDriver to get the page source.  This method simply means there is no browser opened when scraping begins.

It is a good idea to get the code working with the IEDriver first to ensure it is working properly and then change to the Phantom and retest.

The driver executables  must be accessible by the project (i.e. included in the project and reference-able by path) in order for it to execute.  The executables are inside the Selenium package.

A VS2013 Project is available for download from here.  This was built using VS2013 and uses asp.net 4.0.

Also on my GitHub.