Search engines still in breach of EU data protection laws

Posted by scott on May 27th, 2010

The EU’s Article 29 Data Protection Working Party has sent public letters to the three major search engines – Google , Microsoft  and Yahoo!  saying that although it welcomes their efforts to bring their data retention policies in line with the law, they are all still in breach of the EU’s data protection directive.

The Working party tells Google that it should reduce the period at which it “anonymizes” IP addresses in it’s server logs to 6 months instead of the 9 months it agreed to reduce them to. It also states that Google’s method of anonymisation is not adequate – Google deletes the last octet of the IP-addresses.

According to the Working Party ’such a partial deletion does not prevent identifiability of data subjects.’ In addition to this, They were not happy with Google’s cookie retention practices, where Google retains cookies for a period of 18 months. ‘This would allow for the correlation of individual search queries for a considerable length of time. It also appears to allow for easy retrieval of IP-addresses, every time a user makes a new query within those 18 months.’ The Working party letter states.

The Working party is bit more gentle on Microsoft and applauded its willingness to reduce the retention period of cookies and IP addresses to 6 months, pending on the willingness of other search engines to follow suit. However, like Google, Microsoft retains Cookie date for 18 months, which again still left room for ‘the cross-matching of search queries for a considerable length of time.’ The Working party also questioned the effectiveness of Microsoft’s anonymisation claims.

Yahoo! had pledged to reduce their retention time to 90 days with limited exceptions for fraud, security, and legal obligations, which pleased the Working party, who welcomed the move to deleting the full IP-address from the first full dataset after 90 days instead of just deleting the last octet, but again there were concerns. ‘a partial deletion of the personal data contained in search logs does not constitute true anonymisation.’ the Working party points out. Also,as with Google and Microsoft it says they were not provided with enough information to technically assess the quality of their anonymisation policy.

Here there was a clear call to all three search engines to review their anonymisation claims and make the process verifiable, preferably by developing a credible audit process involving an external and independent auditing entity. ‘The actual techniques of anonymisation deserve an open debate, open to  public scrutiny, in light of the expanding body of research on the failures of anonymisation.’, states the Working party.

The Working party also recognises the transatlantic of the issue and states that it has forwarded its concerns to the Federal Trade Commission (FTC), and asked the FTC to use its authority to examine the compatibility of this behaviour with section 5 of the Federal Trade Commission Act.

It’s good how long you can get away with breaking the law isn’t it. Look forward to the responses from Google and Co.

Bringing Taxonomy and Folksonomy Together

Posted by scott on January 29th, 2010

There is a nice straightforward article by Thomas Vander Wal on Zdnet blogs about Taxonomies.  In it he explains why - as many of us know - that withi an organisation it should not be a choice between traditional taxonomies OR social tagging/folksonomies, but that it should be an AND choice with the latter informing and feeding into the former.

The fact that a Taxonomy should be a living thing is one I have seen overlooked in companies before.

 Vander Wal uses the following table to show how by combining them each of their strengths accounts for the other’s deficiencies.

Good article. Well worth a read.

Social networking data taxonomy

Posted by scott on November 26th, 2009

Just read an interesting post by Bruce Schneier on his idea for a taxonomy of social networking data, which he sets out as follows:

1. Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.
2. Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
3. Entrusted data. This is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data — someone else does.
4. Incidental data. Incidental data is data the other people post about you. Again, it’s basically same stuff as disclosed data, but the difference is that 1) you don’t have control over it, and 2) you didn’t create it in the first place.
5. Behavioural data. This is data that the site collects about your habits by recording what you do and who you do it with.

Interesting discussion in the comments (so yes, you should go and actually read the original post!) about whether Bruce’s taxonomy works - for example whether 2-4 are all Disclosed data subsets - 2. Disclosed data (controlled), 3. Disclosed data (entrusted), 4. Disclosed data (incidental); whether there should be an additional facet covering Crosslinked Data; and whether the term ‘ “Incidental Data” is obvious as a label or should be called something else.

Breaking news at Bing … no one cares

Posted by scott on November 23rd, 2009

Techcrunch and others seem to be running with the story of Rupert Murdoch’s News Corp looking to maybe cut a deal with Microsoft to allow it to carry links to its news content whilst cutting off Google.

I’m sure the folks at Google are thinking – so what.

If Microsoft and Murdoch think that this deal would suddenly drive lots of people away from Google and over to Bing I think they are showing yet another spectacular lack of understanding of how things work.

Most people use Yahoo News / Google News / MSN News etc to find out ‘what is going on now’ - they are after breaking news. They are not – as a rule – looking for ‘what’s breaking at The Sun / The Times etc. If they are interested in that then they’ll either go directly to that paper’s website or they’ll already have some sort of rss feed set up from that publication to alert them to any new content.

Microsoft/Bing would need exclusive deals with a LOT of publishers for this to have any effect what-so-ever. For the ROI, I’m not convinced that would be money well spent.

Taxonomies - reanimated or dying?

Posted by scott on April 22nd, 2009

The Current issue of Online has an article by Seth Earley entitled ‘The New Versus Old Schools of Taxonomies, Metadata,and Information Architecture’ which takes as its starting point a prediction from CMS Watch that predicted with the rise of social computing that ‘Taxonomies are dead. Long live metadata’.

Earley is having none of this and argues that far from being dead, taxonomies are more important than ever.

He identifies that this line of thinking might be because of the old view of taxonomies, where taxonomy = navigation and nothing more. In today’s world, with methods such as faceted search, the worlds of navigation and search merge behind the scenes. “[B]uilding out faceted search and filtering based on attributes requires a well constructed taxonomy. The controlled vocabularies of their taxonomy drive choices for the facets, and hierarchies allow users to refine their choices intuitively.”

He makes a convincing case for for having multiple taxonomies too - something I have had many pro and con discussions about over the years. According to Earley “Multiple taxonomies are needed to “represent various perspectives within the organisation.” On its own this might be statement you’d question but he goes on to make a very important point.

“Do not confuse this with having multiple navigational hierarchies. A navigational hierarchy is an access structure. It is dependent upon the context being indexed. However, building multiple taxonomies and surfacing them within new tools will allow for maximum flexibility in navigational constructs.”

He concludes by saying that to survive taxonomies need to evolve and change with the needs of the organisation and also with the changes in technology, but that they still have an important role to play in how we access information and mine internal knowledge effectively. Worth a read.

Newssift - promising new business news search engine

Posted by scott on March 19th, 2009

The FT has launched a new business news search engine called Newssift. Newssift indexes about 4,000 business news sources, from online newspapers and blogs to news portals and research sites. It is ingesting about 120,000 articles a day right now and applying semantic tags to each one. In the end it can categorize each article by business topic, organization, place, person, and theme.

You can enter a search term, or choose from news topics - random topics currently in news in the Business, Organisation, Place, Person, and Theme categories . If you put in a search term, all these topic boxes are then populated with any matching terms. For example, I put ‘Clifford Chance’ in the search box and got hits on Organisation and Person.

Clicking on the Organisation link, it ran a search returning 25 articles. The search can now be further refined by adding additional terms yourself or by selecting additional terms that have appeared in the original 5 category boxes.

The search results also provide two pie charts - one ranking whether the articles are positive, negative or neutral and one detailing where the articles come from.

This has a lot of promise. The Faceted search approach is a good one. Whilst is by no means perfect out of the box, it is certainly worth keeping an eye on and adding to you list of news search tools.

Update: It has been pointed out to me that the T&Cs for the service have been written by someone for whom web2.0 hasn’t happened - or more likely they simply copied and reworked old FT.com terms from around 5 years ago that also used - like many news sites at the time - to have daft ‘linking’ permission rules. Someone remind me how the web works again?

You may be granted a limited, nonexclusive right to create a hyperlink to Newssift.com Web provided you give FT Search Inc. notice of such link by writing to privacyofficer@newssift.com, (ii) FT Search Inc. confirms in writing that you may establish the link, (iii) you do not remove or obscure the copyright notice or other notices on Newssift.com Web, (iv) such link does not portray Newssift.com Web or any of its products, software, content or services in a false, misleading, derogatory or otherwise defamatory manner, and (v) you immediately discontinue providing a link to Newssift.com Web if so requested by FT Search Inc. You may not use an Newssift.com logo or other proprietary graphic or trademark of Newssift.com to link to the Newssift.com Web without the express written permission of FT Search Inc.
…Except as expressly approved by FT Search Inc. in writing, you agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion, or use of, or access to, Newssift.com Web.

All this wouldn’t be as funny, were it not for the fact that the site is a news aggregating service. I hope Newssift removed this backward and pointless clause from their site asap. Oh, and whilst they’re at it: Saved Searches - great , but how about an email/RSS option rather than having to revist the site each time, which frankly, I’m not going to do.

Google launches another blog - do we care?

Posted by scott on February 10th, 2009

I notice that Google has launched a Social Web blog, which will provide ‘news and updates about Google products that are helping to make the web more social.’  Now this sounds all well and good and might turn out to be a useful and informative resource, or it might just turn out to be a repeat of the Google Librarian Central blog. Remember that one?

Last summer Google pulled the plug on this half-hearted attempt, to reach out to the Librarian/Information professional sector, so that it could focus its energies on the connected quarterly newsletter - Google Librarian Newsletter. This dazzling level of focus resulted in the issue published at the time they closed the blog, to focus their energies, (10 July 2008) being the last one actually published.

Seven months have now passed: so they have either mislaid that focus or just have no news to report.

The Social web blog is off to a good start with me as it does not display properly in IE6 (at least not my work version) with my needing to scroll half a blank page to get to the first post.

Find Any Film

Posted by scott on January 28th, 2009

The UK Film Council has just launched a film search engine - findanyfilm.com - allowing UK film buffs to locate films legally available in the UK ( Note: It does not class DVDs encoded for regions outside of ‘region 2′ as being legal).

According to the site: “our aim to give you the best possible movie experience, promoting every type of film that’s available for viewing in the UK. To do this, we’re working with the UK Film Council, the Industry Trust, and exhibitors, distributors and retailers, to promote legally available films - making high-quality, legal viewing one of our top priorities. We’re also committed to becoming the UK’s most comprehensive film-watching search engine, with an ever-increasing number of titles and a wide choice of reputable retailers. We list films across every format, and if a film is not immediately available, we’ll do our best to find it for you and let you know the details as soon as possible.”

A search for the Louise Brook’s classic, Pandora’s Box returned 6 results - it seems to be a free text search, unless you go into advanced search where you can search by Title, Actor, Director, Year, Genre, Certificate, Language, or Keyword.

Not a brilliant start. Why is the 2001 film “Trois 2 - Pandora’s Box.” the number 1 result, and “Pandora’s Box” number 2. How can an exact title match be the second hit??

Clicking on Pandora’s Box, brings up a new page, providing Film viewing options: In the Cinema; On TV; On DVD; On Blu-ray; To Download; To watch Online; In Other Formats.

Where, for example, it is available on DVD, it will take you to a page listing a number of online sites, where the film can be purchased, with prices also displayed.

If the film is not available in your preferred viewing format, the service lets you sign-up for alerts on film’s, so you can be informed when they are coming to a Cinema near you, appearing on TV, coming out on DVD/Blu-Ray, or being made available via downloads or online. This is really good, although I would have hoped that this information could have been delivered by an RSS feed as well as via individual emails - which does not seem to be the case. Also, there does not seem to be an option to have a ‘page’ to visit to display/check just what films you are keeping an eye out for. The site could do with taking at look at Locate TV to pick up some pointers on that point. It does does something similar for TV and DVD, and can let you create your own ‘watch list’ page - [Although Lottie, I’d STILL like an rss feed option there too! Not going to stop asking :-)]

The alerts page also provides a list of ‘related titles’. From here, for example I can see that another fine Louise Brook’s film - Diary of a Lost Girl -  is on at The Barbican at 4pm on 22 February 2009.

I think this could turn into an excellent resource for all film buffs. Worth a look.

Rexyo - the non-search searchengine

Posted by scott on November 20th, 2008

Just been trying out a new search engine called Rexyo.

It’s brilliant - sadly, not in a good way.

According to Rexyo: “Rexyo is a versatile, highly scalable search engine currently available as an online Beta version. It is considered different from other search engines because it can search visually within images and video but also in audio and other types of data.”
You got the look

The UI is ugly. They seem to have taken Exalead’s idea of allowing you to save some favourites to the page, but not bothered about how it would look. I do Like the tags (recently searched for words) box, or at least the idea of it, as the execution of it is bad.

Searching

As is my way with all search engines, I started with the easy task of doing searches for: Informationoverlord, scott vine, and ioverlord.

Results

Informationoverlord = 0
scott vine = 0 [But helpfully its Spelling Manager came up and said: wrongly spelled: scott suggestion: scout vine ]
iOverlord = 0

Ok, maybe it is not meant to tackle such hard tasks, so I decided to take another approach and search for: eMusic (as I spoke about that in my last post); Ryan Adams (as I am going to see him in concert this evening) and Google (to see if Rexyo could recognise a search engine).

Results

eMusic = 612 [ although Spelling Manager popped up again. wrongly spelled: eMusic suggestion: music ] Reasonable number of hits on this one. One slight problem though: apparently it is incapable of finding eMusic’s website. I went through the first 10 pages of hits and it wasn’t anywhere to be seen.
Ryan Adams = 869 [ Spelling manager: wrongly spelled: Ryan suggestion: ran Adams / wrongly spelled: Adams suggestion: Ryan dams ] Only had to go to page 4 of the results to find ryan-adams.com. A success in Rexyo terms.
Google = 957 [ Spelling manager. wrongly spelled: Google suggestion: cackle ] Once again, I had to give up after 10 pages of results, without seeing site of google.com

One final note, a search for Rexyo on Rexyo doesn’t find rexyo.com.

Maybe I have got it all wrong and it can ONLY seach within images, audio and video? But even then, this is quite possibly the worst search tool I have ever used. Dreadful on a quite spectacular scale.

Images + Hyperlinking = patent infringement?

Posted by scott on May 29th, 2008

It’s has been a while since a really good internet patent case came along - remember BT wanting to claim they invented hyperlinking (the Court didn’t agree) - well now a small Singapore based company, called VueStar Technologies, has started sending out invoices the websites informing them that they need a licence if they use pictures or graphics to link to another web pages.

There claims are based on a patent granted in 2006 (applied for in 2001) for a system that “provides a web-page (or web-site) search results list which includes images from the actual web-pages or web-sites identified in a user’s search, or images associated with the actual organization operating a web-site. This assists a user to locate web-pages of interest or relevance to the user by providing images to assess the relevance of web-pages identified in a search, prior to the user having to hyperlink to the actual web-page itself.”

So, search engine image searches, such as Google Image Search will sooner, or later, be getting their invoices in the post. Indeed, depending on your reading of the patent a high percentage of all websites (including this one) will be infringing on the patent in some form. On the company’s website FAQs they claim that clicking, scrolling or streaming over a visual image to connect with a website or Web page is an infringement, and invite companies “to view and make their own assessment and then make a democratic choice. To use visual images or revert to text only options.”

At the moment the company is targetting sites based in Singapore, but it intends to expand this to Australia, New Zealand, and the US in the coming months.

It seems likely that, at some point, the courts will be called upon to rule whether or not VueStar can enforce the patent and if so how wide an interpretation will be allowed. VueStar are apparently confident they would win any court case - so were BT, but I’d be very surprised if this didn’t go the same was as the BT case, with the Court’s finding the scope to wide and much of it unenforceable. Still, one to keep an eye on in the coming months.


Copyright © 2007 Informationoverlord. All rights reserved.