Jase Bell Data Design

Saturday, 21 February 2009

Twitter data mining, there's one element missing.

I've been umming and ahhing over this one for a week or so and the immediate answer just wasn't in my field of vision. There's lots of talk about data mining Twitter feeds as a use for company feedback. SAP already have something called the "sentiment engine" as part of their Business Objects Text Analysis suite that was using the Twitter API's to gather information on a company or product and predict what the buzz was.

Twitter signups are lacking one major piece of information and that's an age range for user accounts. It makes slicing any form of decent information a real challenge. Facebook at least asks you for a date of birth but extracting it something I have never tried.

Tesco learned the hard way about not knowing age segmentation when the Clubcard was in its early days. Pensioners were not wanting money off vouchers for Coca Cola as they simply didn't drink the stuff. They weren't quiet about it either. Localisation would work well with Twitter analysis as you can define a geocode radius to the result set.

So we can easily find out what people think about our products but never find out what age they possibly could be.

Original blog post on the Sentiment Engine in the Insurance Technology Blog.
More on the SAP BOTA here.

Friday, 14 November 2008

US Currency fluctuations and running cloud instances.

Most of the cloud providers I have come across are based in the US. No such bad thing I know. My concern for a Eurozone or UK start up is the currency fluctations.

For example: Amazon EC2 instances are USD charged (prices calculated with Amazon's calculator).

Instance Type	Total Hours	Annual Cost (USD)	Cost at May 2008 ($2.00/£1)	Nov 2008 ($1.486)/£1)	Annual cost difference
Small Instance	8760 hr	$876	£438	£589.50	£151.50
Med Instance	8760 hr	$3504	£1752	£2358.00	£606.00
Large Instance	8760 hr	$7008	£3504	£4716.01	£1212.01

Obviously currencies will fluctuate daily but it is an eye opener for a UK start to see how much the costs incurred on a larger instance will be. This doesn't take into account anything like S3 storage or SQS charges. Even if you moved S3 storage to an euro based account you have transfer costs to take into account.

So, the likes of GoGrid, Aptana, Flexiscale and Amazon all charge in USD and make the case for Cloud in the UK a cost problem if you don't financially plan properly in the start up phases.

I know the costs are purely theoretical as the $2.00/£1 exchange rate was six months ago and it's steadily slid during the course of the year. If things continue in that trend and a $1/£1 exchange rate happened (stranger things have) the costs in the Eurozone and the UK will steadily increase.

If you take a UK company for example, a single instance in its most basic form, starts costing the following:

	Monthly USD	May	June	July	August	September	October
Small Instance	73.2	£36.99	£36.75	£36.95	£40.21	£41.21	£45.35
Med Instance	292.8	£147.96	£147.00	£147.78	£160.85	£164.85	£181.39
Large Instance	585.6	£295.92	£293.99	£295.56	£321.71	£329.69	£362.78

Learning Python and going with Google's AppEngine will start to appeal to small scale companies purely to exercise cost savings.

Friday, 7 November 2008

Loyalty Card Data - Perhaps the true cloud battleground.

In 2004 I worked on a project with a company called Rippledown Solutions, later their app became System42 then it got totally rebranded into Erudine. One of their demo pitches was loyalty card data and it's something I've been fascinated with ever since. To mine that amount of data takes some doing.

Anything with large datasets has the potential to be a cloud goldmine (if it's not already) including web usage data, video, photography and music. The main problem with loyalty card data is getting your hands on it... not the easiest thing in the world to do. At the end of the day it's a lot of numbers: date/time, store location, item, offer code... Tesco go deeper and create more data and a better idea of how to target their data.

In the UK there is one clear market leader who knows what they are doing with the data, and that's Tesco. That's more down to Dunnhumby who put a lot of backend consultancy together. Tecso mines the best part of 18 million customers data. From my dealings with Rippledown the reason, we found out on the grapevine, that you only got vouchers every three months was that's how long it took to mine the data.

This left the rivals in a complete state of non starter. The Nectar card scheme never really did anything and upon that Asda pretty much gave up on loyalty cards altogether. Nectar in the end got Peter Gleason to go from Dunnhumy to Nectar, he was one of the team who knew data analysis. Not a programmer that I'm aware of, just a smart marketer.

Consumer data is big money to the supermarkets as they can sell it on. And the faster they can process it means they can monetise it quicker. Sounds like a perfect elastic platform waiting to happen. There's a good chance it's already happening. One company that stopped selling their data though was Walmart, selling competitor information about suppliers is never a good thing. Wal-Mart operate their own data centre with over 460 Tetrabytes of data storage and boy do they know how to do stuff. CIO Linda Dillman could use the data to predict buyers strategy for hurricanes (PopTarts sales increase seven fold and the pre hurricane favourite was beer), predicting is much better than analysing after.

What doesn't come out in the wash is the speed that this data can be collated and processed. The Tesco problem was always sheer volume to process over a three month period, I'm not sure how Wal-Mart latency would be in all of that.

Instance replication is the big plus point here, getting servers to replicate during high loads of data processing. The processing of the New York Times archives (now a Cloud Computing must read) gives the sense of what's possible with instance replication. Sainsbury have gone to a system of offers at the point of sale, ie looking at the customer's basket and creating offers from there. It doesn't go in to any historical depth. A web service to a cloud backend (with all the previous data run during the night) could recall the new offers based on what they've already bought. The key here is the customer, it's easy to know out a voucher to reduce petrol/gas by 5p a litre but a whole other to push a new type of butter on the customer who's spent the last month buying margarine.

If there's one arena to keep an eye on it's this one.

Thursday, 6 November 2008

So what derailed Digital Railroad and could the clouds part?

Digital Railroad has provided the tech press with a wonderful set of railroad relate puns. Be prepared to see lots of, "hitting the buffers", "derailed" and the rest of it.

During October photographers were getting note that things were financially tough and were looking for additional funding, then out of the blue they then got note that they had until 31st October to get their assets of DRR's servers. The main page of the site now plainly says that DRR is under Diabolo Management who are talking to a potential buyer with over 25 years in the game. Rumour was that the company was Newscom, they were interested but declined.

What this does raise is in interesting question of how this could have been avoided. Firstly if a company runs out of cash then it runs out of cash so it doesn't matter if there's a Cloud infrastructure in place or not, if the money's not there then it's not there. DRR's press release heavily hints that the running of the servers. From an accounting standpoint where does $15 million go? Well there was sixty staff, no doubt rent to pay and that's before we get to the servers and electricity.

Could someone have turned the dial back and started looking at the alternatives a little earlier. Something scalable like Amazon S3 would have been a perfect use and would have saved on the overhead for fixed server storage. Keeping in mind that my full size pictures from my Fuju S5 Pro come out at a good 7/8Mb each you can start to understand the scale that DRR were working with. Some pro photographers were keeping 5000+ images on their servers.

Looking at the maths and a per photographer storage cloud seems a good idea. 8mb x 5000 is just over 39Gb. With a couple of gig transfer in and out you could do the lot for less than $10 a month. Seems to me that the likes of Smugmug and SendPortfolio got their pricing right.

The idea of a company like DRR or Alamy for that matter looking after my interests never appealed to me. I have always been in charge of my images and where I put them, plus the images I sell tends to be through private channels. What I'd like to see is a storage solution where I can put my images (S3) and then let the channels of my choosing (Alamy, Photoshelter, Getty and the rest) then point to my storage server. First of all it cuts down the huge server demand and storage on the suppliers and makes life a lot easier for the photographer. Metadata can be stored and, more importantly, queried. If it's been done then I'd love to know.

The first priority now is to make sure that photographers that are owed royalties from image sales are renumerated properly. As it stands that looks like it won't happen.

Friday, 3 October 2008

Open Bluedragon perhaps the better way to develop for me.

Coming out of "retirement" has been interesting, perhap sabattical is a better word. A lot has changed including my outlook on developing software, startups and the rest of it.

For me personally JSP/J2EE was never a rapid application development platform. PHP is common, in fact I'm just sorting a project out now for a client, but it's still a mess. CFML, I've liked for a while but it's always been the case of the cost of entry, you needed a licence for the server and when the $$$ are coming directly from my pocket I slam the breaks on until I know the business is going to absorb the start up costs.

So, my return to the arena and I find out that Bluedragon became open. Now CFML now looks like a viable solution. I downloaded the version with Jetty as the server, worked first time, no complaints, well documented to get started. A quick peek under the cover and I could work out how to sort out data sources. The OpenBG team have done a brilliant job.

In terms of rapid application development, I'm getting on quicker than I would with Ruby On Rails. With RoR I always had to go back and change the forms to how I wanted them. RoR was great for real quick scaffolds but you can spend as much time replumbing things to how you want them.

I know everyone has their own style and preferences, but the CFML/Java combo looks good for the things I want to do. It's nice to be back.

Jase Bell Data Design

Saturday, 21 February 2009

Twitter data mining, there's one element missing.

Friday, 14 November 2008

US Currency fluctuations and running cloud instances.

Friday, 7 November 2008

Loyalty Card Data - Perhaps the true cloud battleground.

Thursday, 6 November 2008

So what derailed Digital Railroad and could the clouds part?

Friday, 3 October 2008

Open Bluedragon perhaps the better way to develop for me.

Who is he?

Where is he?

Articles written by Jason Bell

Belfast Telegraph

IBM Developerworks

Java Developers Journal

Blog Archive