Html Scraping using Selenium Automatic Web Browsing

Standard

Selenium is a set of tools across a range of languages (both supported and unsupported) to aid automatic testing of web applications.

The tools allow one to automatically browse a website and to, for instance, take screen shots without the need of user control.  This obviously means such tests can in principle be ran over night and the results checked the following day.

Selenium can also be used to automatically scrape html pages, for such reason as checking advertising content is correct over a given time period on a big website. Saving the html and comparing for changes is a way one ensure correct content is being displayed over the course of a week/month etc.  This would be particularly useful for a website with many content contributors (and a poor CMS), so accidental changes to html can (and usually) does occur.

There are a range of tools in the Selenium suite (both by Selenium and by third party developers), allowing testing to be done across most browsers.  The below outlines code for Internet Explorer and Phantom JS (explained individually).

// using the Internet Explorer Driver
  public InternetExplorerDriver createBrowserDriver()
 {
 // this sample was for a website with secure pages
 // so a login u/p was required
 string LOGIN_URL = loginUrl;
 string username = loginUserName;
 string password = loginPassword;
 // get the path to find the driver executable
 // downloaded as part of the Selenium package
 string path = AppDomain.CurrentDomain.BaseDirectory.ToString();
 // open IE to login to the website
 var options = new InternetExplorerOptions()
 {
 InitialBrowserUrl = LOGIN_URL,
 // Gets or sets a value indicating whether to ignore the settings of 
 // the Internet Explorer Protected Mode.
 IntroduceInstabilityByIgnoringProtectedModeSettings = true;
 };
 var driver = new InternetExplorerDriver(path + "drivers", options);
 
 return driver;
 }

The above returns an instance of the InternetExplorerDriver which is used to create an Internet Explorer ‘browser session’ and urls can be passed in to be browsed.  The below code does this.

 // the website this code was created for had a login screen with the destination url being used as a
 // querystring and being navigated to after successful log in, therefore a simple count can be used to tell
 // if login is necessary.
 count = count + 1;
 // find login page
 if (count == 1)
 {
 // if this is the first URL in the list, do login and add url to querystring
 driver2.Navigate().GoToUrl(LOGIN_URL + URL);
 driver2.FindElement(By.Id("loginID")).SendKeys(username);
 driver2.FindElement(By.Id("password")).SendKeys(password);
 driver2.FindElement(By.TagName("form")).Submit();
 }
 else
 {
 //otherwise reuse the IE logged in browser to navigate to the proceeding placements
 driver2.Navigate().GoToUrl(URL);
 }
 
 // get the page source
 htmlString = driver2.PageSource;

Using the InternetExplorerDriver will use IE to do the automated browsing.  This is good, and fully supported by Selenium, but sometimes seeing the browser working is not necessary, likely being a distraction if the machine is being used for something else.

PhantomJSDriver is a headless WebKit; basically a browser without a front end.  This means one can do automated browser testing without a browser window popping up; perfect if the app is intended to be used by someone that doesn’t need to see a browser in action, and more to the point, needs to use their machine for other work.

The code for PhantomJSDriver is below.

public PhantomJSDriver createBrowserDriver()
 {
 string URL = "http://www.google.com";
 string path = AppDomain.CurrentDomain.BaseDirectory.ToString();
 
 var driver2 = new PhantomJSDriver(path + "drivers");
 
 return driver;
 }

There are no options that need passing, so this method is a lot cleaner.  The driver can be used in the same way as the InternetExplorerDriver to get the page source.  This method simply means there is no browser opened when scraping begins.

It is a good idea to get the code working with the IEDriver first to ensure it is working properly and then change to the Phantom and retest.

The driver executables  must be accessible by the project (i.e. included in the project and reference-able by path) in order for it to execute.  The executables are inside the Selenium package.

A VS2013 Project is available for download from here.  This was built using VS2013 and uses asp.net 4.0.

Also on my GitHub.

Portable IIS Server

Standard

An example of development of some C#.net process management tools (in-house tools) used to monitor HTML displayed on the front-end of websites.

As there was no access to a server with external network access, there were no servers that could be used to browse the web. The tools being updated however required Internet access, as they were used to visit web pages, examine the HTML and check for particular HTML tags.

To get around this, consideration was given to ascertaining means to make IIS Portable. IIS Express comes close to this.

Further information was sourced from found this website http://goo.gl/IBt4Fx, by Marcin Kawalerowicz. The package can be downloaded here: http://goo.gl/fTXvfw.

The ASP.Net code can be published from Visual Studio to the ‘web’ directory and the batch file fired to execute the portable server.

Following this, one may need to tweak the WEB CONFIG file, as the above is a brief example of using WEB.CONFIG (and creating an ASPX page with master page).

Photoshop PSD Layer Extracton using C#

Standard

An example of exporting layers in Adobe Photoshop PSD file.

//import the photoshop package - taken from the link below
using Photoshop;
 
//create a new psdfile object
Photoshop.PsdFile psd = new Photoshop.PsdFile();
 
//load the file into the psdfile object
psd.Load(filePath);
 
//loop through all the layers in the psd
for (int j = 0; j < psd.Layers.Count; j++) {
    //add each layer to a drawing image object
    System.Drawing.Image myPsdImage = ImageDecoder.DecodeImage(psd.Layers[j]);
 
    //save each layer as a png using the layers name as the file name
    myPsdImage.Save(pingsOutputPath + psd.Layers[j].Name + ".png");
}

The Using refers to a little library collection available from the below link.

http://goo.gl/9fNdCP

This library collection is part of a screenshot application, the creator appears as an email address – see this url for more information about the application – https://code.google.com/p/skimpt/

Using this as well as Adobe Photoshop’s actual DLLs – PS should be installed – one can complete a number of PS processing on images, it is arguable that this is not as effective as using actual PS, but is useful for saving as PNGs.

I have added a small vs2013 website to github to demonstrate.

You can download a VS2013 project for this from here http://goo.gl/nhCucT.