Machine Readable: March 2011

Retrieving external URLs via PHP has already gotten me kicked off a shared server in the past. A handful of visitors may be enough to incur the wrath of your admin. While redesigning my personal homepage, I wanted to scrape my recent running mileage from an Endomondo widget (login required) and embed it into the site text. The simplest way to do this, would be to retrieve and process the widget code from the Endomondo site for each page load.
This approach has two drawbacks: first - increased pageload times due to the remote retrieval, and secondly - increased server load.

To solve this problem, I decided that updating my mileage daily would be sufficient, but I still wanted the updated process to be automatic, and take place on the server.

To accomplish this, I wrote the following code. The concept is simple - scraped data is held in a file on the server, on each page load, the modification time of the file is checked, if it is over a day ago, new data is scraped and written to the file.

This way, remote fetching and it's associated slow-down and server load happen at most once a day.

The code is pretty much self explanatory, but I'll discuss some of it's finer points below.

function howfar() {
 $rep = "many";
 if (!($mfile = @fopen('mileage.txt','r'))) {
  $latest_mod = 0;
 }
 else {
  $latest_mod = filemtime('mileage.txt');
  fclose($mfile);
 }

 $today = date(U);

 if ($today-$latest_mod > 86400 ) {
  //echo "OLD!";
  $mfile = fopen('mileage.txt','w+');
  $endo_url = "http://www.endomondo.com/embed/user/summary?id=92256&sport=0&from=20110101&measure=0&zone=Gp0200_JER";
  $endomondo = file_get_contents($endo_url);
  $loc = strpos($endomondo,'Distance:');
  $loc2 = strpos($endomondo,'</span>',$loc);
  $distance=substr($endomondo,$loc+21,$loc2-$loc-21);
  fwrite($mfile,$distance);
  fclose($mfile);
 }

 else {
  $mfile = fopen('mileage.txt','r');
  $distance=fread($mfile,4);
  fclose($mfile);
 }
 if (is_numeric($distance)) {
  $rep = $distance;
 }
 return $rep;
}

First, we attempt to open the 'mileage.txt' file, the @ suppresses any possible error messages we don't want the user to see. If the file doesn't exist, latest_mod will be set to zero, thus forcing an update.
If the file is readable, latest_mod is set to its latest update time. The times are then compared, and if the difference is greater than 86400 (a day in Unix timestamp format) - new data is fetched.
Otherwise, the file is opened, and the old data read.
You could easily modify the code to fetch nearly any for of data, and store it locally, the minimum "freshness" of the data is also easily tweakable.

I hope you find this snipplet useful, and I'd love to hear about it in the comments if you did.

Retrieving external URL's with PHP without overloading your server