Cluster-based Recommendation with Mahout

Mahout includes a few new experimental recommenders that are weakly documented at the moment. One of them is TreeClusteringRecommender which clusters your model into a set of groups and makes recommendations based on distances between your users and items in these clusters.

A clustered-based recommendation may be a good choice if your data is sparse and rarely correlated to find obvious patterns. Another advantage is clustering may help you to provide recommendations to users even with very tiny data available. Yet, it decreases the level of personalization and output is not unique to users but to clusters.

Here is a quick start to create and run one:

UserSimilarity similarity = new LogLikelihoodSimilarity(model);
ClusterSimilarity clusterSimilarity = new FarthestNeighborClusterSimilarity(similarity);
Recommender rec = TreeClusteringRecommender(model, clusterSimilarity, 20);

rec.getCluster(1); // gets the cluster of userId=1
rec.recommend(1, 10); // recommends 10 items to userId=1

What about user similarity and cluster similarity?

  • You basically have to provide a similarity function to be able to measure distances between different users. You may like to represent each user with a vector and calculate euclidean, Pearson, cosine, etc. distance by looking at these features. Or LogLlikelihoodSimilarity may work as well. You may want to look at Surprise and Coincidence to understand what's under the hood of this similarity.
  • Cluster similarity is newly represented here. It's a place to customize the measurement of the similarity between two clusters. There are already two implementations: NearestNeighborClusterSimilarity and FarthestNeighborClusterSimilarity. Beware that clusters are dynamic. As new data comes in, old clusters may be merged and new ones may be introduced.

And the rest is about experimental work to find a method fit for your data by analyzing the nature of your input, plotting the results and evaluating. The initial clustering takes a lot of time compared to item or user similarity based recommenders of Mahout, yet it works to be OK online once you start working with pre-computed data. Even it's not very mature, for a primitive start, you may still like to consider this recommender if what you want to achieve fits in clustering.

The Rise of the Open Science

Open science is opening the way we make science. It stands for transparency and public accessibility of scientific data, collaboration, methods and results. On the other hand, it supports the existence of public contribution to the current state of science, and giving it back to the public domain.

Motivation

While we are making science, we rely on the older publications and methods those are often published with no open access to data. Years ago, academic community skeptically started to question the credibility of the research work on the existing literature. The way that science is funded was one of chief reasons behind this question. Science made with non-open data had possibility to be easily led by politics and other funding authority such as private companies to mislead the facts such as global warming or medical side-effects of a new medicine. Firing up an openness discussion led another ideas such as opening the methods and scientific source code.

Why to open data, open tools and open results?

One of the core values of science was being open and accessible. But ironically science is today receive heavy financial support from private institutions and governments where much of the budgets are shaped by economical, industrial or military needs. Scientific institutions are mostly closed to people without PhDs for scientist roles because there is already a huge competition among PhDs. Our credibility is measured by the number of papers published and number of citations we receive. I wouldn't want to slander scientists but professional science, as in its own closed ecosystem, has a few conflicts against the key foundations of science. Science's route, subject, people and results are controlled or may have possibility of being controlled by authority. In next few decades, we have to reissue the way we sustain science.

We also do have a verification problem with science that relies on data. Computational and statistical science is lacking in reproducing the final results advertised on publications. JASA (Journal of the American Statistical Association) reports that only 21% of the papers are being published with source open in 2011, still a positive number compared to 2006's 9%. Without code or data, even the work is published on an academic journal, there is no way to validate or iterate over the existing founding.

One of the key problems as we can address is that scientific research is not maintainable without economical sustainability due to the need of scientific tools. I've watched Eri Gentry, the founder of BioCurious, at OSCON last year. Her key points about opening the scientific tools, in the self-makers' vision was motivating. According to her, at some point at BioCurious, they needed to have a PCR machine that was costing several thousand dollars to keep their garage based research on. Since they can't afford the machine, they decided to analyze how they are actually working. Fortunately, they've figured out what it's about and created OpenPCR. And now you are able to copy some strawberry DNA sequence or make cancer research at home. An open repository of knowledge on making scientific tools will increase the level of collaboration from regular makers and DIY people who may never have chance to investigate or be able to reverse engineer these tools.

Collaborative Science

By the radical changes in means of communication, discovery and discussion will have to change radically as well. A few months ago, I've seen a book by Michael Nielsen called Reinventing Discovery: The New Era of Networked Science on the new arrivals section. Nielsen opens the first chapter by a 2009 story about Tim Gowers' Polymath Project. Tim Gowers is a very notable mathematician, a Fields medalist from Cambridge University. In 2009, instead of working alone or with his existing pairs, he decided to discuss a mathematical problem on his blog and asked for readers to share their ideas online. In 6 weeks, he received 800 comments from 27 people. Although start has a its pitfalls, 37 days later Gowers announced they have not just solved his problem but the generalization of the polymaths problem including a special case.

And what about citizen science? Citizen science is used to be perceived as a more pro way of scientific crowd sourcing. But this perception seems to be changing. Very recently, I had a few discussions with friends who are totally aliens for citizen science and its current initiatives. They preliminary questioned the need of citizen scientists. Our main talk was about classification of galaxies on GalaxyZoo. GalaxyZoo is an online tool that shows you images of galaxies taken by Hubble telescope and wants you to manually choose if galaxy is elliptical or spiral or it has some set of features or not. Any programmer would initially ask why we are doing this classification manually in 2010s. Honestly, we have technology to pick up the features directly from signal without any observation from a human eye. So? But, discovery is not classification. We actually don't know what we are looking at. Any anomalies or any strange looking objects would be a new scientific discovery. By reviewing the existing images, GalaxyZoo members discovered a new type of galaxies, now we call them "pea galaxies" and Hanny van Arkel, a Dutch school teacher, discovered a green strange nebula-looking object in the size of the Milky Way Galaxy called Hanny's Voorwerp again in 2007.

So, why aren't we taking it any further? There is an ongoing afford to make a cultural shift to increase the awareness and participation into science. Not only Zooniverse projects but NASA has opened code.nasa.gov very recently. Ariel Waldman is keeping a dictionary of all citizen space exploration projects on spacehack.org for a while. LHC's ongoing CMS project donated data to Science Hack Day participants to let data hackers come up with data visualization tools for CMS. DIYgenomics are crowd sourcing genomic data. The list goes on...

Conclusions

With the ongoing momentum in scientific communities, in the next few decades, we'll experience a tremendous change in they way we make and participate in science. For now, not intercepting conventional means but creating possibilities, new science is approaching with the strong sympathy for making scientific results freely and universally accessible.

Android's RTP implementation

Although still being not really mature, Android is supporting RTSP streaming for a long time. In theory, it's very trivial to play an RTSP link with MediaPlayer controller.

MediaPlayer player = new MediaPlayer("rtsp://...");
player.prepare();
player.start();

But in practice, MediaPlayer implementation is not fair enough to give you responses and you basically dont know what's going on since your media is not playing. I will be generally talking about network layer, so you will have a basic idea how to configure your media servers.

RTSP and RTP

Generally we call it RTSP. But RTSP streaming has two phases: RTSP and mostly RTP to transform actual media data. RTSP is a stateful protocol. While making the first connection, it agrees on a bunch of details and exchanges data about the media being served between client and server. These are done with a family of directives. These directives are sent on TCP 554. The RTSP flow includes OPTIONS, DESCRIBE, SETUP, PLAY/PAUSE/etc. On the request made for SETUP directive, client specifies what transform protocol it'll support (in this case, it's RTP) and on which protocol and which port. Android clients choose UDP and a range starting form 15000 to 65k. This range may change from phone to phone, manufacturer to manufacturer. Summary: There is absolutely no standard at all. If you look at native MediaPlayer implementation in Android codebase, you will see no specific range as well. So, it's very likely for you to have trouble. Another bad point is, RTP is usually supported on a port range between 9k-15k on TCP (e.g. Blackberry devices). And if you read tips and tricks about configuring a server, you won't be able to catch the Android fact.

Note: This post was a draft for about a year, I reviewed it and posted. There's nothing over-dated according to my practical knowledge. If you are against me, contact me for fixes.

Edit: Some phones fallback to TCP based streaming when UDP is not available.

Setting bounds of a map to cover collection of POIs on Android

Lately, as I browse web for maps related questions on Android, what's frequently requested is an example of setting bounds of a map (zooming to a proper level and panning) to be able show all of the pins given on the screen.

Most of the maps APIs provide this functionality such as Google Maps API, so developers seem to have problems with implementing theirs. Google Maps API for Android does not provide functionality for setting bounds to a box. Instead, what's provided is to zoom to a span.

com.google.android.maps.MapController.**zoomToSpan**(int latSpanE6, int lonSpanE6)

latSpanE6 is the difference in latitudes 10^6 and similarly lonSpanE6 is the difference longitude 10^6. You may question how map controllers know where to zoom in just by the differences. For examples, kms between longitudes differ from equator to poles. Fortunately, Google maps projection has them in the same length. This may remind you the infamous South America versus Greenland syndrome. Although Greenland is much much smaller than South America, it doesnt look so with this map projection.

On the below, I implemented a boundary arranger method for MapView. Method takes three arguments: items, hpadding and vpadding. items as you may guess is a list of POIs. Other arguments are a little bit more interesting. hpadding and vpadding is the percentage of padding you would like to leave horizontally and vertically so that pins don't appear just on the corners. For instance, if you assign 0.1 for hpadding, 10% padding will be given from top and bottom.

BTW, You'll have to extend the existing MapView and add this method to your own MapView to use this method properly.

public void <strong>setMapBoundsToPois</strong>(List<GeoPoint> items, double hpadding, double vpadding) {
    MapController mapController = this.getController();
    // If there is only on one result
    // directly animate to that location

    if (items.size() == 1) { // animate to the location
        mapController.animateTo(items.get(0));
    } else {
        // find the lat, lon span
        int minLatitude = Integer.MAX_VALUE;
        int maxLatitude = Integer.MIN_VALUE;
        int minLongitude = Integer.MAX_VALUE;
        int maxLongitude = Integer.MIN_VALUE;

        // Find the boundaries of the item set
        for (GeoPoint item : items) {
            int lat = item.getLatitudeE6(); int lon = item.getLongitudeE6();

            maxLatitude = Math.max(lat, maxLatitude);
            minLatitude = Math.min(lat, minLatitude);
            maxLongitude = Math.max(lon, maxLongitude);
            minLongitude = Math.min(lon, minLongitude);
        }

        // leave some padding from corners
        // such as 0.1 for hpadding and 0.2 for vpadding
        maxLatitude = maxLatitude + (int)((maxLatitude-minLatitude)*hpadding);
        minLatitude = minLatitude - (int)((maxLatitude-minLatitude)*hpadding);

        maxLongitude = maxLongitude + (int)((maxLongitude-minLongitude)*vpadding);
        minLongitude = minLongitude - (int)((maxLongitude-minLongitude)*vpadding);

        // Calculate the lat, lon spans from the given pois and zoom
        mapController.zoomToSpan(Math.abs(maxLatitude - minLatitude), Math
.abs(maxLongitude - minLongitude));

        // Animate to the center of the cluster of points
        mapController.animateTo(new GeoPoint(
              (maxLatitude + minLatitude) / 2, (maxLongitude + minLongitude) / 2));
    }
} // end of the method

W3C Widgets: The good, the bad and the ugly

It hasn't been a while since ppk wrote about totally a new W3C movement called "Widgets". A Widget is a downloadable archive of HTML, JavaScript, CSS and a configuration file. It's a downloadable web front-end. Basically it's designed to build mobile apps to avoid extra network usage consumed to download heavy weight pages, CSS and JS. With Widgets, you only consume network traffic for data transmission. Before getting into details I have to share a fact that according to my knowledge, Opera Mobile is the only browser around with Widgets support.

You can read Vodafone's tutorial to make a Widget first to have an initial look.

The Good

For many years, I've been in a huge debate with people who uses work force inefficiently by their 35k different platforms and SDKs. Half of the developer have written HTML once in their life and JavaScript has a very large developers base. Every new mobile platform is usually re-inventing the wheel once again and this default action is usually driven by business fears.

Widgets make software accessible anywhere you can run a browser. It's definitely "Write once, run everywhere". And the complaints about slow page transmission is being fixed by running them from local resources.

Widgets will push mobile web browsers to act more similarly as applications base grow. Many of the extensions such as geo-location APIs dont really fit each other and some mobile browsers provide totally non-standard features. If web applications dominates the mobile, community will push browsers to act better.

It's easy to get in. You dont have to download SDKs, learn another language and read documentation/tutorials to learn something new.

The Bad

Performance. Native apps run fast. Even Dalvik empowered Android is horrible and not really responsive compared to other platforms' applications because of Java. Heavy JS on web browsers are not scalable and just like most of the other browsers, Safari on iPhone has rendering issues even on local websites.

Forget the advantages of Web when it comes to releasing software. No on the fly updates at all. Software should be downloaded again and again as new versions release. Accessibility to internal platform is questionable. Open platforms like Android provide access to internals such as contact lists, file system and invoking other applications. If mobile operating system manufacturers cant meet at providing the similar APIs, this wont work.

The Ugly

I find the old-generation of mobile development community is very ill-minded. They use the know-how to make money and this community is interested in their complex and closed environments.

On the other hand, the only contributor is Opera for now. I'm not really sure if they go for larger market share or not. If an open standard acts like a diverse platform for Opera browser phones, it's the same story.

Maps Development on Android: Registering a Maps API key

Location based applications are musts on mobile platforms. Android does not have maps natively but Google Maps team is providing an add-on that comes with Android SDK (at least 1.5). In this post, I'm not going to show you how to pop out maps on your little mobile screen, but underline the application signature details related with Maps API.

First of all in our to show map tiles properly, we need an API key. This is all because you are requesting from Google Data API and have to agree with the terms of service. I'm also sure that quote rules also apply.

Every Android application is signed with a signature of the publisher. While obtaining a key, you must provide the MD5 summary of your signature to Google, and Google activates possible transactions between Maps API and the application your signature signs. In order to complete these actions, you have to

  1. Obtain the MD5 summary of your signature. If you do not have a signature, you can use the default one.
  2. Sign up for an API key directly from Google by providing the hash of your signature.
  3. Use API key with map elements and generate a sample map view.

Obtaining an API key

You will have to use the keytool to obtain information about signatures. If you haven't created one, Android SDK puts a default one in your ~/.android directory. In this tutorial, I'm going to show you how to register with this default signature. Open a terminal prompt and enter

$ keytool -list -keystore <strong>~/.android/debug.keystore</strong>

It's going to ask you the password of the keystore (debug.keystore). Default is "android". If you receive a MalformedKeyringException, you are giving the wrong password. If everything works great, it will output a few lines of information including the hash. Please read the summary line and copy the hash.

Certificate fingerprint (MD5): <strong>E8:F4:F2:BF:03:F3:3A:3D:F3:52:19:9B:58:20:87:68</strong>

After obtaining the summary key, you can jump to the next level -- signing up for an API key. Give the hash as input and register. Please note the API key Google has given to you.

Generating Maps on Android

Android SDK comes with two archives. First one is the android.jar which contains the standard platform libraries. And maps.jar which is a library dedicated to generation of maps. In the maps API, you will notice MapView. You can extend MapView to customize and add new features to show a custom map view. Or invoke the existing methods to perform simple operations like panning, zooming and adding overlays to show information on the default map. There are great tutorials about Android's map view and controller on web, I simply didn't want to copy-cat the existing. Google's Hello, MapView is a place to start.

Multiple-Developer Cases

A signature can only be associated with a single API-key. What you are going to do if development is made across a team? You dont need to create different signatures for each developer and register them to use Data API one by one. Register a single signature and obtain a key. Then, distibute the signature among the developers - better add it to your version controlling system.

Why do Code Reviews Matter?

We build software together. Team sizes vary a lot but it's usually not 2 or 3. Team members leave, new members join and in the end of the day codebases are shared among large numbers of different and diverse people from development, testing and deployment. Code reviewing process is what you take into action every time you make modifications to the code's itself. Even though you change a single line, before committing the code to repository, a peer reviews it and confirms it can be submitted. Reviews can help to decrease the number of bugs, vulnerable code pieces, misuses of coding standards and etc. In an environment with no reviewing practices, actual code wont be directly visible to developers before an other member opens file(s) for editing. Without reviews, even though build passes QA tests,

  1. Readability of the code,
  2. Compatibility with coding standards,
  3. Organization of the codebase,
  4. Documentation inside the code /* comments */,

will stay as surprises. These may lead to a dirty and badly organized base after a few years of continuous development. Asking a peer before you integrate new code into your product is a better approach since you may most probably have more time to fix, reshape and enhance your code if you take the action earlier. Usually only one other team will have time to review your code. But reviewing should be done with two people. First one should be a master, the one that knows the existing system well and can see the big picture. He can guess the impacts of your certain changes among the codebase. Other peer should be ideally someone who is not very familiar with the code, so you can test how easy to get into your code if you have to assign a bug to a new member, how well it suits with patterns in software development, how simple and obvious your solution is.

Consequently, whatever you are working on is shared among people. Asking for a review is always better than not asking and keep your actual contribution as a secret until it needs to be changed.

Let's modify our representation of addresses in adr microformat

Microformats define a representation spec for addresses, called adr. This year, I made two distinct proposal to modify the current draft, but turned down each time I tried. In this post, I'm going to address the current problems and how tiny enhancements can bring new horizons in the retrieval of location based information.

The problem

Current spec does not serve as a latitude, longitude carrier. Current properties only include post-office-box, extended-address, street-address, locality, region, postal-code and country-name which are text fields to form an address. This schema is defined in vCard and migrated to hCard microformat in 2004. Then, the need of address-based extraction led them to copycat this format and call it adr. vCard's final design spec was way before we had online maps. Nowadays, we have addresses all over the Web. Automatically directing these text addresses to locations on maps or providing a preview on hovers would be the first basic attempts to improve our data representations. But unfortunately, maps are talking more in mathematics than text addresses. In practise, there is a process that takes addresses and transforms them into a latitude, longitude couple and pans map to that location. This process is called geocoding, and it is far away from perfection in today's scale. Instead of depending on a geocoder to transform addresses into mathematical locations, I suggest microformats to enable built-in (lat, long) arrays in adr.

Extending adr with a set of latitudes and longitudes

What I'm going for is to extend adr with an optional list of (lat, long) values. So, in cases where coordinates are given, instead of asking a geocoder to land us on a location we can directly move. But why to use a list of coordinates and not a single point? Because, in spatial domain different geometric structures are being represented as different shapes. Examples are below.

  1. If you are talking about a city centre, it's most likely to be a Point.
  1. Mississippi river is a long long Line.
  1. And a university campus is obviously a Polygon.

In the image above, British Museum is represented by 12 latitudes and longitudes as the inner area which these points compose. On the other hand another representation may be made with the centre point of the museum with (51.529038,-0.128403). Formally speaking, the museum is located on "British Museum, Great Russell Street, London, WC1B 3DG, UK". And this translates to the coordinate I gave. What about using them together to form:

<div class="adr">

<div class="street-address">Great Russell Street</div>
<span class="locality">London</span>,
<span class="postal-code">WC1B 3DG</span>,
<div class="country-name">UK</div>

<div class="geo"> <!-- optional coordinates attribute from geo -->
<span class="latitude">51.529038</span>,
<span class="longitude">-0.128403</span>
</div>

</div>

In the example above, I've used geo to include the single point optionally to map the address to a physical location. More useful structures can be defined within standards to enable multiple point entries to provide polygons such as 12-point representation of British Museum in the image above. Or basically, multiple geo entries inside an adr may work.

Testing for Accuracy and Precision

Software testing has no boundaries at all. This discipline is so unique that it’s not very common to see systematic approaches due to the variety of material and the changing tradeoffs. A few weeks ago, I came across to a decent software testing article from a Microsoft engineer which was published on Live Spaces. Unfortunately, it was followed by 2 spam comments -- was very ironic to see such an assertive article was ruined by two regular Russian spammers.

I love machine learning and classification. My whole life is being spent between two parameters: accuracy and precision. These are the common statistical values to determine how successful your system is. If you have a search engine, accuracy may tell you what percentage of retrieved documents are really relevant. And percentage is a value to determine how likely your results cover all the relevant documents available.

Surprisingly a few days ago, I was asked to break a machine learning system during a job interview. I was asked to come up with some possible cases. According to my own philosophy, accuracy and precision are parts of the system requirements. They are related with the quality of the overall product. But how are you going to collect information to come up with these numbers? Imagine you are working on a searching engine. Is it manageable to find n people and ask them manually if they like the results or not? Will your sample (n people) reflect your user base? How costly it will be and how objective? Is it really scalable? Is it possible to for a human to read all of the documents on the Web and decide which are really related to his search phrase? These are a few introductory level problems with analysis of accuracy and precision.

Post-Processing and the Importance of Feedback

It may not be critical for you to release a product with a target accuracy and precision. Mostly, consumer market suits this model the best. But this alone should not be translated into the “inessentiality of the quality tracking”. I am just advising you to track the quality after the release (similar to ship-then-test method). Detect which results are exit-links, provide instant feedback tools for users to relocate their results and etc. Use acquired feedback to improve the existing system. Testing may not be done with the release, you may need to discuss and analyze if your product is performing well and report to your development team and influence them with scalable user-oriented improvements.

Addresses Not Found in High Traffic

My sister found herself a new downloading hobby and I was not planning to be the hobby killer until everything became inaccessible for both of us. She’s heavily downloading recently, I’m not sure about the material but it’s high load. Pages were coming slower on my side as it was expected and I’m not saying I have a wide bandwidth but overall bottleneck was not just the slower uploads or downloads.

UDP 53, what’s wrong there?

I started to recognize a pattern. My downloads were even more slower because resolving was failing miserably every time I try. I was not even able to resolve domain names to IP addresses. Had to check myself what might cause this problem. As a quick note, if your local DNS cache (managed by operating systems) doesn't have a record of the domain name you’re trying to visit, you make a request to one of the nearby DNS servers to return the associated IP. If your nearby server doesn't have that record, it asks to root servers etc. Most of my reader audience knows the story well. This communication is made on UDP port 53. UDP is a connectionless way to transmit data. Unlike TCP, you don't have to spend time on three-way-handshakes to make a proper connection that both of the sides are aware of. But if your packets get lost, nobody is responsible. It’s like playing a game, many tradeoffs similar to every engineering issue.

I gently asked my sister to stop a while, and started receive not timed-out UDP answers back. Resolving problem was fixed. But I had to be convinced that UDP is the best ever been chosen from. I understood the fact the essential parameter was latency. We have to be fast, faster and fastest as possible. Wanted to take time back to understand why it is designed this way and my problem appeared with a solution in milliseconds.

Why DNS is using UDP?

Reliability versus fastness. Remind the rule. If you don’t have the address, ask a nearby name server. Is it implicitly saying “Don’t go too far.”? Probably it is. You’re not on a very reliable connection and if your traffic load is very high, there will be many conjunctions, long delays and large jitters. My dns requests most probably couldn't even making it to the name server. And since my ISP’s name servers are not reliable, I was using OpenDNS. Translation: I was far far far away from the source.

I fixed the issue. Even crazy downloading is again on, my domains are resolving rapidly at the moment. I’m extremely happy. If you’re using OpenDNS at office or LANs which have more than 20+ clients, make yourself a favor and set up a local name server today.

Data Manipulation on Client Machines

It's my third week and I'm discussing the scalability of the real-time web. We're only talking about text input, realtime search, trend extractions and etc. I had a love growing inside for this instant replier, it makes me feel I'm more connected to real people (sort of egoism). It's good because we have text, none of the realtime providers are working on more than indexing the "text" for searching, as it revealed. Text is medium of communication and there are a lot more: images, audio, videos etc. You are not really interested in multimedia as I guess because, text is cool. You can skim it, you can select it, easily process it. But the world is not man-made and I cant even imagine a moment where maps are served as text, for example. Typing is human's built-in analog to digital converter. Love it or not, but we are forced ungracefully by this nature to talk in multimedia when text alone is not efficient enough.

Realtime Data Processing

Realtime environment have to play with data to make connections, be able to provide smart searching that does not only depend on full-text comparison. Imagine that they have to tokenize text input to post-process what's going on with the message. Almost requires O(n) on CPU. Twitter has about 1 million users, let's assume averagely every user enters a new twit once in 3 days and average entry is 100 characters long (doesn't sound realistic, but let's be optimistic). It converges to 350k posts a day. 350k times tokenizing the 100 character-long input = 35 million characters are processed during a day. I tried to tokenize a 100 char string 350k times and it took only 41 seconds since I was using the same string over and over again. With helps of caching, CPU minimizes the memory I/O and it made a huge difference, so my 41 seconds were far from being accurate. But besides tokenizing, there are other operations you have to run. And once you fetch it from memory, you're done. Therefore, I believe it's not really an extra load to tokenize the input on the server-side.

But, what would you do if you have to post-process terabytes of imagery? I'm not sure if you are aware of Microsoft's Virtual Earth 3D but, it is more like Google Earth running on your browser.

A very long time ago, I was making a demo to my friends and showing the Mt. Rainer in WA distinct in 3D mode. Virtual Earth 3D fetches higher quality imagery for forefront. Since there is no colour balance adjustments at VE imagery, many people thought the level differences on the scene is sort of a corruption and not good looking. I decided to talk to engineers that we can solve this problem with relative histogram equalization (fancy name but easy method). I didn't sound perfectly realistic, cause our imagery were tens of terabytes and it was very risky to process them all for such a tiny improvement.

Luckily, I recognize client power for the first time and the power it brings out. I was going to add a improvement step to equalize imagery once it is loaded as "it is" to user machine. User's itself was about to play a role to balance images on the front-end.

Future of the Web: More Realtime Data Needs More Client Power

Twitter example is quite lightweight, 35 million characters/day is not huge as you may guess. As millions started to use realtime equipment passionately and generate a hundred thousands times more content than this, we may have to take advantage of our user's machines mostly for pre and post processing of data. All of the new client technologies are about the provide a rich environment and framework to let the users contribute to the overall system with scalability. Future is going to be all about making a distribution decision between clients and servers. If you are a server guy who has no idea on the user machines, study your lesson before this balance dominates the world.

Edit: Sorry for quoting VE's 3D mode with Silverlight, thanks for Jon O'Brein (a Microsoft MVP) for fixing my deadly fault. I was high or something.

Using Paper and Pencil Before Coding

I don't remind myself starting to code, even a single tiny function, without illustrating it on paper. Do I write code on paper? Yes, I also do that. I adore coding on whiteboards where many people can come together on a coffee break and discuss. Most of the young developers ignore that single rule in software development. To write decent code, you have to spend more time on thinking, criticising, brainstorming – determining what the problem is before getting into the solution. Defining the constraints, getting aware of the exceptions, and so on.

Early Starting or Deep Thinking?

I wish the first idea comes to mind was the perfect one. But usually, it is not even close enough to be a good solution – it is often greedy and very straight-forward. The more you learn about the problem and its characteristics, the better and more efficient solution will pop up. If you start coding in the first hour, you will probably never ever finish the task. I'd like to share a case study from Jon Bentley's amazing book, Programming Pearls.

The story is about Andrew Appel and his experience on reducing the run time of a system from one year to one day. They had to simulate the motions of N objects in 3D space, given their masses and initial positions and velocities. They basically were interested to predict the positions of planets, stars, galaxies 10 years, 100 years, 1000 years later. In a 2D space the input may look like the image on the left: a large set of vectors.

The regular simulation programs divide time into small steps and computes the progress of each object at each step. Due to the existing of gravitational field and mass attractions, the computational cost were proportional to N2. Appel estimated that 1000 time steps of such a system with N=10000 objects would require more than one year on a VAX-11/780 or a day on a Cray-1. Andrew had to develop a system far more efficient than a one-year-running turtle. After several speedup improvements, he reduced the runtime to less than a day on VAX-11/780 -- almost 400 times faster than the initial setup.

He managed to change the concept on many levels to make such a huge difference.

  1. Data Structures: Foremost, O(N2) had to go. He decided to represent objects as leaves on a binary tree; higher nodes were representing the cluster of objects. The force operating on a particular object can be approximated by the force exerted by the large clusters. This alone reduced complexity to O(N logN) by a factor of 12.

  2. Algorithm Tuning: Initial algorithm has to use tiny time steps to handle the rare case that two particles come closer to one another. Usages of tree-based structure enabled investigation of such rare cases possible independently from time steps. As a result, Andrew doubled the time step size and halved the run time.

  3. Data Structure Reorganization: The tree that was representing the initial objects was quite bad in representing the later sets. Reconfiguring the tree at every new step was a bit costly but reduced the number of local calculations and halved the total runtime.

  4. Code Tuning: Due to additional numerical accuracy provided by the tree, 64-bit floating point number could be replaced by 32 bit single-precision, halved the runtime. Additionally, profiling the application showed that 98% percent of runtime was being spend in one procedure. Rewriting that procedure in assembly increased the speed by a factor of 2,5.

  5. Hardware Extensions: Moving the program to a similar machine with a floating point accelerator halved the total runtime.

Finally new system was requiring only one day to simulate 10k objects. Most of the work done by Andrew Appel were visible without writing a single line of code. If your inventions don't rely on the statistical results to be proved, you should be able to make all decisions without writing a single line of code. Being able to predict the system's response on paper can make you the the one makes the differences.

Even in professional environments, people tend to start coding as soon as possible. Initial design doesn't only lead you functional mistakes and a program with tones of bugs, but may blind you to see the problem from different aspects and representing data in a different ways to fasten the system. Only good estimators can code good software. Therefore, I'm not interested in which new fancy technologies you know, far more interested if you have a wide range from top to hardware, able to see the bottlenecks and improve them on paper. Then, any developer can write the code.

JPEG to Compress Vector Graphics?

Compressing, entropy and information theory have things in common with economics. Nobody's going to turn his head unless you stop talking maths. I truly understand they are a bit boring and complex for your grandmother, but surprisingly most of the technical people doesn't bother to understand the underlying concept either. In this post, I'm going to pass over the following topics in a daily tongue to illustrate the overall scheme on your mind:

  1. What is image compression and why do we use it?
  2. A short brief of JPEG compression.
  3. A review of Google Maps and Live Search Maps for serving images as the primary content.

Introduction to Image Compression

If you understand the term "compression", it doesn't mean we are over with the definition. Hold on.

Image compression, the art and science of reducing the amount of data required to represent an image, is one of the most useful and commercially successful technologies in the field of digital image processing. (Digital Image Processing 3rd Ed., Gonzalez & Woods, page 525)

Let's first try to understand how compression became one of the most commercially successful field in image processing? With the irrepressible popularity of television and Internet (after mid-90s), images and videos become significant elements to represent information. With no compression; a coloured standard TV broadcast, 640x480 wide with refresh rate of 30 frames per seconds, requires 27,648,000 bytes to be transmitted per second. Even in tomorrow's technology, supplying a connection of almost 30Mbytes/second just for a TV cast doesn't seem to be possible with no doubt. It's no surprise to hear many failed quotes back from early 1900s that television will never be able to find opportunity to be on the market.

Data versus Information

When the issue is compression, it refers to the compression of data. Data is being transferred to carry information. Therefore, we might be able to reduce the amount of data to represent a given quantity of information. A parrot in a very populated barber shop in downtown loves to say "Hello" to every new customer that comes in. How would you transmit the words it spells in text most efficiently?

Statistically hello is the most common word. Representing it in a bit is highly acceptable, instead of transferring 5 characters (5*8 = 40 bits). In the example above, it's very clear that statistics say "stranger" is the second word we most likely to hear from parrot and so on. Converting (mapping) the string array into a bit stream saves 94% of bandwidth in this case. Huffman coding, which is going to remind you the method above, guarantees you to use minimum possible number of bits if you have statistical information of data.

Image Compression Techniques

Generally, image compression techniques are separated into two columns: lossy and lossless compression. Lossy methods takes the advantage of capabilities of human vision range, eliminates details and loose information to reduce amount of data. Lossy methods are mostly used for natural images. In lossless compression, encoding process finds a smart way to represent same amount of information in lesser amount of data, just like in the example above with Mr. Parrot. And some may use hybrid models to mix advantages of both sides.

JPEG Compression

JPEG is a hybrid compression method, optionally can perform as totally lossless. It mainly depends on the fact that correlation between close pixels doesn't change much in natural photos. So, we are able to guess closer pixels around if we know the one in the middle. JPEG's encoding algorithm is separated into four steps:

  1. Blocking the image into little pieces, usually 8x8 pixel blocks – some use 16x16 blocks. Separating images into little pieces will help us the guess the nearby pixels.
  2. Discrete Cosine Transform (DCT): In "simple" terms, this process generates a bit stream for each block which starts with the average value of the block continuing the details of how pixels change horizontally and vertically. So, as an addition to the guess, we have extra information to fix it. Therefore, DCT is totally lossless. It only represents the same information in other words. Outputs a stream of 64 numerical elements like: 125 34 –32 43 2 21 –43 1 –3 4... (Smart way to represent information)
  3. Quantization: Adjusts precision. This is where we loose information. Output stream may turn into: 125 35 –30 45 0 20 –45 –5 5... (Human vision range is not that sensitive)
  4. Entropy encoding: You may encode the stream coming from quantization with Huffman coding to save bandwidth – just like how we did in the parrot example. (Smart way to represent information)

Maps and Compression

Maps serve images as their primary content. Most of the providers have road, aerial, labelled road, labelled aerial, terrain (Google does), bird's eye (Live Search Maps does) and etc. In this section, I'm going to discuss mostly the aerial imagery and how many bytes they are spreading to share the same amount of information. Both of the providers have road, aerial and aerial (with labels) image sources.

Google Maps (maps.google.com)

Google Maps choose to separate vector and natural images. Encode vectors as transparent PNGs and aerial images as JPEGs. And adds vector information as an overlay if user prefer to view the area with labels.

Image sizes in the corresponding order are: 27.6KB (JPEG), 13.7KB (PNG), 0KB (overlaid versions of 1st and 2nd image), 25.6KB (PNG). In this case, let's count the bytes Google consumes on the following actions:

  1. Road imagery views: 25.6KB
  2. A switch to labelled aerial: 27.6KB + 13.7KB = 41.3KB
  3. A switch to aerial without labels: cost free
  4. Total cost = 25.6KB + 41.3KB = 66.9KB

Google only seems to make an encoding decision depending on the image' s characteristics, JPEG for aerial, PNG for rest. But unfortunately, this leads to the transmission of repeated information of roads on an aerial switch. Take a look at image 2 and 4 above.

I'm guessing most people don't tend to switch to aerial soon until they zoom in to street level. So, instead of pushing them to download two separate images for on road mode, Google Maps may choose this method. HTTP requests are costly. And a dozen of them are more costly than costly. A map control usually has to download 8-12 tiles depending on your screen resolution. Just think of the overhead that those 12 tiles are generating.

Live Search Maps (maps.live.com)

We are on the same area, somewhere in the downtown London again with Live Search Maps.

Image sizes in the corresponding order are: 28.52KB (JPEG), 29.82KB (JPEG), 25.6KB (PNG). Repeating the same user actions above for a tile:

  1. Road imagery views: 25.6KB
  2. A switch to labelled aerial: 29.82KB
  3. A switch to aerial without labels: 28.52KB
  4. Total cost = 25.6KB + 29.82KB + 28.52KB = 83.94KB

If every user were switching between these map modes back and forth on every zoom level, Google Maps would beat LSM with little difference*. Labels on aerial images don't looks as sharp as Google's, but keeps size in an appropriate level although it is encoded to be a JPEG. If users switches to non-labelled aerial very rarely, for example after pointing an area they are searching, this method might be practical and more efficient than Google's overlaying concept.

Above, there are 200% zoomed versions of labelled aerial tiles from San Francisco. The left image comes from Virtual Earth and the other is Google's. In my own personal opinion I find Google's labels more readable than Virtual Earth's, despite being 22-years-old and topping the heap when it comes to sensitivity in vision.

(*) Note: I repeated the process for 100+ random tiles. Ratios don't seem to change much.

Custom tiles?

What about custom tile integration with Virtual Earth and Google Maps API? Google's method allows you to overlay roads on your custom maps (didn't check it, but it's not a big deal -- they already have isolated labels).

Would you Like to Throw Exceptions from a Constructor?

Throwing an exception from a constructor has always been an arguable issue between my friends after a few beers of joy. It firstly came to me on an exam where question was asking me to design a class which lets not more than n number of instances to be constructed.

Not using the most straight-forward trick to have a private constructor, I decided to invoke an exception from constructor if there exists more than n instances of that particular class. Luckily, I was coding in Java and had the comfort of a managed environment under the belt. My answer wasn’t graded thankfully but this case became a major concern and I decided to make several experiments back in the old days.

According to what I have seen while debugging, Java simply assigns a number of nulls or zeros to all of the fields of the class unless constructor is successfully executed. Additionally, right after an exception is thrown from a constructor, a null reference is returned and the newly created invalid instance starts to wait for the garbage collector to pick it up and remove the useless object from memory permanently. In short terms, a new instance is created although platform tends to murder it quickly in silent.

What about Unmanaged Platforms?

It was nice to see somebody was taking care of such exceptional situations but what would you do if you were alone? Fault tolerance isn’t just catching all exceptions: (from slides of Bjarne Stroustrup)

  1. The program must be in a valid state after an exception is caught
  2. Resource leaks must be avoided

This will become a problem where memory management is all up to you. You will have to remove and cleanup resources after an exception has been thrown, otherwise it will cause a leakage in memory and you will have created a zombie object hanging around memory -- attached to the mother thread.

Cleaning Up the Mess of C++

An object is not even an object once its constructor finished successfully. So, better try to release resources inside the constructor before you throw an exception. In this case, I'm always using a double-exception throwing pattern that I made up and passionately using it in similar situations I face.

Let's ask the complier! Two lines of code is always more representative than 10 lines of words. Here below, I've written a class called "Foo", not functional but dangerous and an exception "StreamInitException" was decided to be thrown if there happens a problem while longStreamChar is being initialized. Right after we allocate 10k bytes for longStreamChar, BOOM! Someone pushed the wrong button, somebody shut down the database or trouble makers' spirits are near. Don't try to catch it inside the constructor and see what happens. Many orphan kids, once we named them as longStreamChar, are all around and we can't even access them.

#include <iostream>

struct StreamInitException {
    StreamInitException(char * m){ message = m; }
    char * message;
};

class Foo {
    public:
        Foo(){
            try {
                this->initCharStream();
            } catch(StreamInitException * ex) {
                this->~Foo(); //release resources
                throw ex; // exception originally thrown by initCharStream
            }
        }

        ~Foo(){
            if(this->longCharStream)
            delete this->longCharStream;
        }

        void initCharStream(){
            //create a long char array of 10k
            this->longCharStream = (char*)malloc(10000);
            throw new StreamInitException("Error occurred during init.");
        }
    private:
        char * longCharStream;
    };

int main(){

    Foo * foo = NULL;

    try{
        foo = new Foo;
    } catch(StreamInitException * ex){
        std::cout << ex->message << std::endl;
    }

    return 0;

}

Therefore, I'm handling init related bad moments of the object inside the constructor, deallocate memory etc. and re-throw the same exception finally after I get the hold of the control. Seems acceptable!

Consequently, I have a final question: do you prefer to throw an exception from constructors or search for other workarounds such as assigning the workload to a separate method called Init()?

Domain Name Server Fight is Back!

This is the second time I’m facing a horrible situation in 12 years of my active Internet adventure. One or a few DNSes ignore my change requests and keep pointing to the old IP although two weeks has passed after my edits. Those rebel servers cause the half of the routers to use the old IP address and keep my old end-point alive recursively. I’m not sure if it’s a TTL issue with an A RECORD or an hacking attempt via DNS injection. Don’t know the answer yet, because I’m not able to access every name server (naturally) although I can traverse my path and know the evil ones.

Does anybody have a clue what is happening behind? Because straight forward actions like switching to my own name servers may not help me with the issue at this moment. The new changes might come to a deadlock on same cycle. Is there anybody out there who knows what’s happening?

An Introduction to Fundamental Web Crawling Strategies

Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web.

What is a Web Crawler?

A web crawler is an automatic web page collector to create a cache/base of local copies of pages found. Initially a crawler starts with a beginning set of known URLs. These known links are also known as seed URLs. Then, crawler extracts the links inside the known documents and responsible to download these newly found pages in some order. In crawling field, there are two major crawler types:

  1. Periodic or Snapshot Crawling: Crawler continues to find new pages until the collection hits a desirable size. Then, periodically it runs the same process and replaces new collection with the existing. There are typically very large intervals between two crawlings. This method doesn’t use any existing knowledge comes from the previous crawls.

  2. Incremental Crawling: These crawlers keep searching for new links although collection becomes as large as it is desired to be. Old pages are repeatedly visited in a schedule to update the collection. It’s very hard to crawl a large source (such as Web) this way. Documents are needed to be refreshed one by one, also new links should be explored to be handled by the local collection. Web growth function is an exponential one according to the statistics. It means there are even more newly added pages than the updates of the existing documents. We unfortunately have a limited processing power, so it becomes more critical to decide which page to (re)visit next from the queue?

Figure above shows it is more useful to choose snapshot strategies for large scope search engines. On the other side, if rapid change of documents is guaranteed for a small scale corpus, it’s more effective to use an incremental crawler.

Scheduling the Download Queue

Although I stated it is more likely to use a snapshot crawling for large amount of documents, large gaps between updates makes it absolutely impractical solution for Web search engines while competitors such as Google explores even the most least significant change in hours. Incremental crawling looks (and actually is) very costly if we cannot predict when a page is updated/removed. Re-downloading the whole Web in short periods is also impossible. But, what if we can guess how often a page changes? What if we can prioritize each URL and schedule our download queue?

Crawling scheduling methods used in commercial engines are kept as secrets like the formula of Coca-Cola. There are only a few fundamental techniques I’ll share here to give beginners an impression.

Breath-first search: The easiest but least efficient crawling algorithm. There isn’t a particular sorting process for the download queue. Documents are downloaded in the same order their URLs are found. But all of the links in a level should be crawled before crawler starts to discover another level. On the figure right, links on the seed level document will be downloaded before the links on the level0 documents and so on.

Best-fit search: This algorithm is currently the most popular search algorithm used in focused crawlers. In best-first search, URLs are not visited in the order they are discovered; instead, some heuristics (usually results from Web analysis algorithms) are used to rank the URLs in the crawling queue and those that are considered more promising to point to relevant pages are visited first. Least important pages have very less change to be visited and continuously put to the back of the queue. Back-link count or partial PageRank have found a large application domain in this field.

Tunnelling: This method is a good solution to find the most relevant URLs that occurs in a page. Let’s assume we are exploring links in document D, and one of the links in D refers to document C. Some crawlers rank C to check how relevant C is to D. For example, if URLs points to C and D occurs on a page other than these pages, they are most likely to be relevant. Or, if both D and C gives link to an existing page other than D or C, they may be relevant. Documents are scored due to their relevance to the root document and queued according to their rankings.

Constraints and Principles in Queuing

A crawler should foremost be polite to the server it is trying to reach. You will most likely to crash or slow your resource if you make thousands of requests in a minute. Queues should be sorted in a way that may not put extra load on the source server. In distributed hierarchies queues should resorted once distribution is completed to avoid unexpected concurrent transactions.

Another problem occurs in synchronization of distributed crawlers. If there are two discrete crawling processes are running and crawler1 and crawler2 find same link on different pages, they should communicate to avoid repeated downloads and post-processing.

A large amount of data lies on the Deep Web – a term used for invisible Web, for documents have no direct links point to them from the visible Web. A download queue should be able to give acceptable priority to this discrete set even if link-based statistical data is not available. As a example from practise, Google started to use Sitemaps a few years ago to explore hidden content and let authors assign relative priorities to help Google crawlers with predicting the change rates.

Functional Programming for Beginners

Recently, I’m facing many questions about functional programming. Instead of answering everybody one by one, I decided to write a blog post about functional programming. In this article, I’ll try to introduce you the FP concept. If you are interested, I advice you to have a hands-on experience. There are many widely used functional languages available today: LISP, Haskell, Erlang and F# (new but promising) are a few to name.

Firstly, a Brief History...

A long time ago, in 1930s, when the world was stuck in another economical recession, lives of four extra-ordinary mathematicians were crossed in Princeton, NJ. These men were not interested in the physical world they were living but trying to create their own universe to find answers about limits of computation – a word not heard by many yet. The area they were interested in was called formal systems and their main problem was to answer which problems are solvable if processing power and memory were infinite. One of them were a truly materialist, a slave of questioning and curiosity, a British guy who decided to move to the new world after graduating from Trinity College. Second was a super brain whose Ph.D. dissertation was accepted when he was just 23 years old, nicknamed “Mr. Why”, a close friend of Albert Einstein. The other two were recent Princeton graduates who decided to go for graduate school. Correspondingly, the names of these men were Alan Turing, Kurt Gödel, Alonzo Church and Stephen Kleene. In 1936, Turing extended Gödel’s study on the limits of proof and computation with replacing Gödel's universal arithmetic-based formal language with formal devices called Turing machines. At the same time, two young grad students Church and Kleene were designing a universal model of computation which was identical to Turing machines in power. Their formal system were called lambda calculus. Let’s say it in a clearer and less scientific-BS way: they invented a language, lambda calculus, that was capable to be the smallest universal programming language of the whole world.

Lambda Calculus

Lambda calculus is the common programming language of the world. The main aim or their inventors was to prove any computable function can be expressed and evaluated using this formulization. In the universe of lambda calculus, the key elements are , and where,

in lambda calculus cannot be associated with different values, therefore it is not called a “variable.” Imagine your favourite iterative programming language don't let you change values of the variables by default. Yes, it sounds like a headache at first, but the whole concept is standing on these rules. Now, let’s move on to a more practical example, for instance, to an function multiples its input by 2.

For great examples, I suggest you to read “A Tutorial Introduction to the Lambda Calculus” by Rojas.

Functional Programming Language Primitives

With no surprise, functional programming languages are artificial languages where the main application is made from a function (or nested functions), a very similar concept to the lambda calculus. In this chapter I’m going to underline most characteristic features (not restrictions) of functional languages.

  1. No sequences of discrete commands. Traditional programming languages are based around the idea of a variable as a changeable association between a name and values. These languages are said to be imperative because they consist of sequences of commands. On the other hand, functional languages are based on structured function calls. A functional program is an expression consisting of a function call which calls other functions in turn (nested function calls).

2. Names may only be associated with a single value. In iterational languages, the value of a name (variable) can be modified. In functional languages, names are only introduced as the formal parameters of functions and given values by function calls with actual parameters. Once a formal parameter is associated with an actual parameter value there is no way for it to be associated with a new value.

  1. No guaranteed execution order. Iterational languages executes line by line (if there are no multi-threaded pieces.) In contrast, functional languages don't guarantee a thing about execution order. You have to declare the execution order by yourself.
  1. No repetitions of names. In functional languages, because the same names cannot be reused with different values, nested function calls are used to create new versions of the names for new values. Similarly, because command repetition cannot be used to change the values associated with names, recursive function calls are used to repeatedly create new versions of names associated with new values.

Advantages of Thinking Differently

So, why do people need to think differently although we had all of the other rapid development languages off-the-shelf and not worrying about the marginal concepts FP brings? People who were interested in distributed computing somehow knows the answer. Have you even heard about mutual exclusion, racing condition or other problems in distributed/parallel programming? Shared memory is a problem and no matter how hard computer scientists work on methods to lock shared resources during critical area executions, usually communication needed to synchronize these events aren’t even worth to distribute the process. Consequently some engineers come ideas like MapReduce where dependency between distributed tasks are none. MapReduce applies most of the concepts I represented above about functional programming.

Secondly, functional programming is more than programming, it’s a way of thinking. You don’t need design patterns for functional programming because it’s a design pattern as itself. With Java, you have billions of features to run the world but when it comes designing a software system, it causes more shortcuts in brain. On the other hand, functional designs are deterministic.

There are many more such as advantages in unit testing, deploying and updating software. Please keep in mind, functional programming is a tough area to transfer knowledge with a blog post. Every section is this post can be extended with a few thousand words. I recommend you to search, try, read, code, live it to have a complete feeling.

Continuous Scrolling from a User POV

It was a few months ago when I first saw continuous scrolling template on Yahoo’s UI Design patterns library. It is also called infinite scrolling or infinite page, but no matter how fancy the name they try, the rating on the library alone was explaining what a user-enemy navigation method it is. At that moment, page gave me an instant feeling like a random guy posted a page without inner sense and taste because; that behaviour was absolutely not expected and against common user experience practices. Unfortunately I was wrong! A few weeks ago, I came across a Live Search feature showcase on Channel 9 discusses the new infinite scrolling on the image search. Wow! Totally without prejudice, was curious and wanted to experience. (Comforting me with an in-mind quote: Anyways, who’s using Live Search? Give the guys freedom to try things.)

I searched “Jef Raskin”. A set contains more than 40 pictures I suppose. Therefore, initial scene was very crowded already and before I hit the bottom another 40ish loaded. Everything was starting to be a little bit confusing and page was looking messy. There were no layers or separators between and it was making it harder to scroll back to an image I previously favoured. After 5-6 loads, it became impossible to navigate between older images and the newly loaded sets. As an advantage, human brain is more sensitive to image data. Hashtags are using the same approach and in free form text, it is absolutely more scary.

Additionally, how am I going to navigate friends to the right page, what about sharing?

Me: Hey, look, your old pictures are popping out on a totally irreverent topic. Type in [phrase here], scroll down 3 times, then scroll up approximately 200 pixels, stop, no sorry, scroll down a little bit, scan the heap. There you are. (Instead of sending 4th page’s link.) My friend: Okeeeeeyyyyyyy.

Finally I’m coming to the point. It doesn’t mean you have to mess up like this, if there is nothing new you can find to improve user experience in navigation. I have seen 4-5 other examples lately, is this becoming the new trend? NO!! I want the good old paging back :’) What’s your feedback?

Edit: I saw the 8px font above informs you about the range, what a tricky improvement, huh? Plus, images stopped to come after hitting 1000. Did the guys think nobody can scroll down that much insistently?

BigTable Concept: Why do the World's Smartest People Ignore Relational DBs?

In the era of the Internet, the key problem is scalability. As cloud's popularity climbs up, we are hearing more about the constraints. So far, I only had time to play with Google's App Engine and Microsoft's Azure Services Platform. Cloud developers are mainly shocked by the new non-relational databases that cloud services use as the only alternative. Google calls it BigTable and Microsoft finds a new place in its own terminology dictionary for BLOB. Many start to wonder what the hype about the relational databases was over the past 30 years. Foremost, let's clear that this is not a replacement, but a more efficient way to store data by eliminating not-that-fundamental super engineered functionality layers of the current relational database management systems. Yes, good news for people makes living by designing super large and highly normalized databases to ensure data integrity.

On a relational database, everything is in control; you can add constrains to ensure nobody will be able to enter a duplicated row. Or in deletion, you can program DBMS to handle the useless orphan rows. But the best, a relational DBMS is going to pre-process your SQL query before executing to avoid silly performance mistakes you can make. Think of the environment now: constraints over constraints, query execution strategies, high-level of dependence and complex indexing methods. This package works great unless you want to distribute the tables to different machines. Can you image joining two tables where tables are distributed over 100.000 nodes? In a Google case, this is the everyday problem (or better, call it an every millisecond issue). Luckily, Google's data has characteristics; according to Jeffrey Dean, they are able to manage constraints DBMSes serve to process data, on the application level. Consequently, Google keeps data in a very basic form as tuples.

BigTable looks like a very large B+ tree. It has 3 levels of hierarchy. All of tables are sorted and those tables are separated into pieces called tablets. First two levels are made of metadata tables to locate you to the right tablet. Root tablet is not distributed, but with helps of prefetching and extreme caching, it is actually not the bottleneck of the system. Final level tablets points to physical files (managed by Google File System). GFS provides 3 copies for each file on the system, so no matter if a machine is going down, they still have 2 other copies somewhere else. In the 2nd figure, a row of a tablet is illustrated. com.cnn.www is the key in this case and value has three different columns: contents, anchor:cnnsi.com and anchor:my.look.ca. Notice the timestamps, these fields may contain more than one version of entry. In this case, as Google crawler finds updated content on www.cnn.com, a new layer is being added. This enables and leads BigTable to provide a three dimensional data presentation.

In the end of the day, BigTable is not rocket science. It is compact and easy to adopt. It is very straight-forward. Many friends know I came with a very similar concept while designing Rootapi two years ago, those were the times I havent heard of BigTable. Additionally I was saving values as JSON (equality operation was enough in querying) in blocks which were multiples of the sector size of my physical hard drives. IO operations were super fast, JSON based web services were super fast and it was highly distributable, although I couldn't find a great environment to explore the severe situations deeply.

As we move on the cloud, this is the way we are going to look at data storage. If you need more technical details, I highly recommend you to take a look at the following references:

  1. BigTable: A Distributed Structured Storage System
  2. Bigtable: A Distributed Storage System for Structured Data - Original publication paper of BigTable, appeared in OSDI'06.
  3. Google File System An introduction to GFS by Aaron Kimball.

An Unusual Proclaim: Partial Classes

When it comes to the myth behind the partial classes (came along with C# 2.0, what usually said is:

  1. Partial classes are great if multiple number of developers need to be working on the same class, or

  2. It helps you to separate generated code from your development. A typical example is Win Forms applications.

These reasons are both true but according to my inner sense; but many overlook the real secret behind portioning of a class. A few weeks ago, I started to code a distributed file system (a virtual one - highly depends on the existing features of Windows) called WinDFS, and have to implement a message transmission engine over TCP/IP to serve communication among nodes. I decided to run it by remote objects, so do I created a serializable Transferable class that has a member of FileInfo. Both WinDFSClient and WinDFSMaster are referencing to the Remoting project for transmission operations. The straight-forward process flows like this: a client asks for metadata of a file/directory by sending the corresponding command with the entity's path and an empty FileInfo element to the master machine. Master searches storage engine fills the FileInfo element with related data and sends it back to the requester node.

Everything is fine and enough up to this point. But what would you do if you want to play more with FileInfo specifically on clients - such as outputting the attributes of an instance of FileInfo as XML? Would it make sense to create a XmlOutputFromFileInfoBlahBlah class and make the solution looks like garbage? No, as long as we are not Christopher Millis.

The most natural way is to extend FileInfo specially for WinDFSClient, note that #1 best resolution has nothing to do with extending the original FileInfo in Remoting. An XML output most probably has nothing in common with other projects. On the other hand, XML outputting sounds like a stand-alone feature but what would we do if we wanted to expand FileInfo with resources WinDFSMaster uses.

At this point partial classes are great design choices. Well, at least it worked for me. I created a class in WinDFSMaster named FileInfo. Added partial keyword to the class signatures, implemented ToXml into the master and it works like a dream now. Masters are able to use FileInfo.ToXml() while it is totally isolated from other projects. Note that you have to have physical access to the source code of the initial class to take the advantage of this method.