March 2011 ~ Top Site Map

Crowdsourcing goes professional: The rise of the verticals

Posted by Internet at Every Where on 12:25 PM

Over the last few months, I see a trend. Instead of letting end-users interact directly with the crowd (e.g., on Mechanical Turk), we see a rise of the number of solutions that target a very specific vertical.

CastingWords: Audio transcription
SpeakerText: Video transcription
Serv.io Translate: Translation (by CloudCrowd)
Serv.io Edit: Proofreading (by CloudCrowd)
CrowdFlower Business Listing Verification
CrowdFlower Search Relevance
CrowdFlower Product Categorization
BetterOCR: Improving Optical Character Recognition
Tagasauris: Photo tagging
MediaPiston: Targeted content generation

Add services like Trada for crowd-optimizing paid advertising campaigns, uTest for crowd-testing software applications, etc. and you will see that for most crowd applications there is now a professionally developed crowd-app.

Why do we see these efforts? This is the time that most people realize that crowdsourcing is not that simple. Using Mechanical Turk directly is a very costly enterprise and cannot be done effectively by amateurs: The interface needs to be professionally designed, quality control needs to be done intelligently, and the crowd needs to be managed in the same way that any employee is managed. Most companies do not have time or the resources to invest in such solutions. So, we see the rise of such verticals that address the most common tasks that were accomplished on Mechanical Turk.

(Interestingly enough, if I remember correctly, the rise of vertical solutions was also a phase during web search. In the period in which AltaVista started being spammed and full of irrelevant results, we saw the rise of topic-specific search engines that were trying to eliminate the problems of polysemy by letting you search only for web pages within a given topic.)

For me, this is the signal that crowdsourcing will stop being the fad of the day. Amateurish solutions will be shunned, and most people will find it cheaper to just use the services of the verticals above. Saying "oh, I paid just $[add offensively low dollar amount] to do [add trivial task] on Mechanical Turk" will stop being a novelty and people will just point to a company that does the same thing professionally and in a large scale.

This also means that the crowdsourcing space will become increasingly "boring." All the low-hanging fruits will be gone. Only people that are willing to invest time and effort in the long term will get into the space.

And it will be the time that we will get to separate the wheat from the chaff.

Uncovering an advertising fraud scheme. Or "the Internet is for porn"

Posted by Internet at Every Where on 10:50 AM

This summary is not available. Please click here to view the post.

Do Mechanical Turk workers lie about their location?

Posted by Internet at Every Where on 4:41 PM

A few weeks back, Dahn Tamir graciously allowed me to take a peek at the data that he has been gathering about this workers on Mechanical Turk. He has assigned tasks over time to more than 50,000 workers on Mechanical Turk, so I consider his data to be one of the most representative samples of workers.

One of the nice tasks that he has been running is a simple HIT in which he asks workers to report their location. At the same time, in this task, Dahn was recording the IP of the worker. Why the task was nice? Because there is absolutely no incentive for the workers to be truthful. The submission will be accepted and paid no matter what. In a sense, it is a test that check if workers will be truthful in cases where it is not possible to check their accuracy.

So, we used this test to check how sincere are the workers: We can simply geocode the IP address and find out the actual location of the worker. (With some degree of error, but good enough for approximation purposes.) For the workers that reported to be based in the US (approximately 22,000 workers), the HIT was asking for the zip code of the worker, making it easy to assign an approximate long/lat location.

To measure how accurately the worker report their location, we measured the distance between the location of the IP and the location of the zip code. The plot below shows the distribution of the differences:

As you can see, most of the workers were pretty truthful about their location. The difference in distance was less than 10 miles for more than 60% of the workers: this difference can be easily explained by the limited accuracy of the geocoding API's and by the approximation of using zipcode locations.

Of course, the flip side of the coin is that a significant fraction of the workers were essentially lying about their location: For 10% of the workers (i.e., ~2250 of them) the IP address was more than 100 miles away from the reported zip code. For 2% of the workers (i.e., ~500 workers) the distance was more than 1000 miles away.

The biggest lier? A worker from Chennai, India who reported a zip code corresponding to Tampa in Florida. The IP was a cool 9500 miles away from the reported location!

The Road to Serfdom, ACM Edition

Posted by Internet at Every Where on 10:01 PM

<rant>

A couple of days back, I got the following email from ACM:

Dear Moderator/Chairs,

This is being sent to everyone with the chairs cc'd as the last and final requeset for the eform below to be completed or your panel overview abstract will be removed from the WWW 2011 Companion Publication and will NOT appear in the ACM DL.

Your prompt and immediate attention to the form below is needed.

permission release form URL: ....

ACM Copyrights & Permissions

Given that this was the "last and final requeset"[sic], I assumed that somehow I missed the previous requests. So, I checked my email to find out how late I was. Nope. Nothing in the archive, nothing in the trash, nothing in the spam, no entry in the delivery log. This was the first notification sent by ACM. They have just forgotten about this. But since they were running late, why not just threaten the authors? It is so much easier to pass the blame to others and be the first one to be aggressive.

What happened ACM, did you start get advice on customer service from your pals at Sheridan Printing, who tend to send requests like this?

But I should not have been so surprised. This email just reflects the overall attitude of ACM. I have experienced this many times in the past. Anyway, I decided to sign the e-form, without firing back.

Donating copyright to ACM

Signing the form was a mechanic action before. However, after reading Matt Blaze's post on copyright and academic publishing, I decided to read the form a little bit more carefully, to see exactly what I was signing.

As usual, we start with a transfer of copyright to ACM. The authors agree to transfer all their copyright rights to ACM, blah blah...

Wait a minute! Why does ACM needs to own the copyright? No good reason. To publish and distribute the article, ACM just needs a non-exclusive license to print and distribute. There is no need to own the copyright.

If we follow ACM's logic, any artist that wants to see their work exhibited in any museum, they need to give up the ownership of their work and give full ownership of their creations to the museum. For free. Without expecting any royalties back in return. Ever. Furthermore, the museum instead of promoting the work, they would lock it in a "patron members access only". For all others, the museum would demand a separate entrance ticket to show each of the collection pieces. (Say, for a friendly price of $5 to see each painting?) .

Anyway, let's not belabor the point with copyright. We know that ACM's policy sucks. We know that ACM is a bureaucracy serving just itself and not its members or the profession. Let's move on.

Let's move to the point that really got me fired up.

Protecting ACM from liability

What got me really pissed was the last part of the agreement:

Liability Waiver

* Your grant of permission is conditional upon you agreeing to the terms set out below.

I hereby release and discharge ACM and other publication sponsors and organizers from any and all liability arising out of my inclusion in the publication, or in connection with the performance of any of the activities described in this document as permitted herein. This includes, but is not limited to, my right of privacy or publicity, copyright, patent rights, trade secret rights, moral rights or trademark rights.

All permissions and releases granted by me herein shall be effective in perpetuity unless otherwise stipulated, and extend and apply to the ACM and its assigns, contractors, sublicensed distributors, successors and agents.

So, not only we should donate "voluntarily" ownership of our copyright to ACM . We also need to protect ACM from any liability.

In other words, ACM wants to get all the upside from owning the copyright, without ever distributing royalties to the contributing authors. (Not that it would be worth much. It is a matter of principle and a signal of respect to the authors, not an issue of monetary importance.) At the same, ACM also wants the authors to provide guarantee that if there is any problem with the copyright, the author will be the one liable for the damages.

All the upside for ACM, no revenue to the authors. All the downside to the authors, no obligations for ACM.

Thank you ACM for caring so much about your members. You will not be missed when you disappear.

Yours truly,
A lifetime member of ACM.

PS: In retrospect, the title of the post is offensive: From Wikipedia's definition of serfdom: "Serfdom included the forced labor of serfs bound to a hereditary plot of land owned by a lord in return for protection". In other words, the slave owners took the product of slaves' work, but in return they provided the protection and military support, to defend the slaves that were working the land. ACM also wants the slaves to "protect the land" as well. I owe an apology to the slave owners for the comparison.

</rant>

The promise and fear of an assembly line for knowledge work

Posted by Internet at Every Where on 11:41 AM

Last week, together with Amanda Michel from ProPublica, we were presenting at the CAR 2011 conference (CAR stands for Computer-Assisted Reporting), on how to best use Mechanical Turk for a variety of tasks pertaining to data-driven journalism.

We discussed issues of quality assurance, how TurkIt-like workflow-based tasks can generate nice outcomes, and briefly touched upon the CrowdForge work from Niki Kittur and the team at CMU, showing that crowdsourcing can potentially generate intellectual outcomes comparable to those of trained humans.

The discussion after the session was a mix of excitement and fear. We have observed in the past how "assembly line" work for industrial production lead to massive productivity improvements and was the basis for much of the progress in the 19th and 20th century. But that was for mechanical work. Yes, it replaced centuries old crafts of the blacksmiths, carpenters, potters, but that was just part of progress.

What happens if we see now the assembly line extended into tasks that were traditionally considered creative and intellectual in nature? What would be the effect of an assembly line for knowledge work?

A few months back, I quoted Marx and Engels who, back in 1848, wrote in their Communist manifesto:

the work of the proletarians has lost all individual character, and, consequently, all charm for the workman. ... [The workman] becomes an appendage of the machine, and it is only the most simple, most monotonous, and most easily acquired knack, that is required of him

(Btw, TIME magazine liked that connection enough to put it into their own article about Mechanical Turk.)

But how likely it is to see this style of work to be extended further in the intellectual field? Are these Mechanical Turk experiments something generalizable, or just cute proof-of-concept experiments?

I was reminded of this question today, when I realized that many intellectual tasks are already commoditized:

The article "Inside the multimillion-dollar essay-scoring business: Behind the scenes of standardized testing" gives a dreadful view of now essays are being scored for the standardized tests.

Based on the description of the article, the (human-based) scoring process "goes too fast; relies on cheap, inexperienced labor; and does not accurately assess student learning." Needless to say, the workers were not exactly enthusiastic about their work. Match that with the computer-assisted scoring of essays, and you have an MTurk-like environment for much more intellectually-demanding tasks...

After reading this essay-scoring mill story, I started feeling a little bit uneasy. The MTurk-style work seems too far away to be in my future, so the discussion is always, ahem, academic. But the essay scoring brought the concept a little bit too close for comfort.

Top Site Map

Crowdsourcing goes professional: The rise of the verticals

Uncovering an advertising fraud scheme. Or "the Internet is for porn"

Do Mechanical Turk workers lie about their location?

The Road to Serfdom, ACM Edition

The promise and fear of an assembly line for knowledge work

Archives

IP Info

Back Links

Popular Posts

Feedjit