Lexicalist

Screen capture of part of the interface. Keyword search=privacy. Lexicalist
ABOUT. Lexicalist reads through millions of words of chatter on the internet to analyze how certain demographics talk and what kinds of things they talk about. We currently break this information down into three kinds of demographics: age, gender, and geography.
METHODOLOGY. Lexicalist works by analyzing rich sources of information online, including blog posts, news sources, and social networking sites like Twitter. Each bit of information is subjected to rigorous natural language processing, which includes a likelihood distribution of being authored over all geographic, age and gender demographics.
All of the statistical results displayed here are then normalized against the volume of information coming from each demographic to see what words are most commonly associated with certain populations. The result is a descriptive snapshot of language as it’s used today.
src
Lexicalist is worth investigating. In fact, it potentially is a time sink/waster for the lexically minded. I wish I knew how a descriptive snapshot relates to language usage. This seems to me to be more a question of the relations between a particular form for methodical description and some particular frame for usage. Presumably there is a relation born by natural language processing.
Okay, over at Wikipedia, the treatment of NLP includes:
Tasks and limitations
In theory, natural-language processing is a very attractive method of human-computer interaction. Early systems, such as SHRDLU, working in restricted “blocks worlds” with restricted vocabularies, worked extremely well, leading researchers to excessive optimism, which was soon lost when the systems were extended to more realistic situations with real-world ambiguity and complexity.
Natural-language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it. The definition of “understanding” is one of the major problems in natural-language processing. src
‘Context is to understanding, . . ‘
(I added Lexicalist to our links.)

Old and New Net
click to enlarge
Web 3.0 from Kate Ray on Vimeo.
This video from Kate Ray quickly made the rounds.
What a long way the net has come. I suppose it necessary but gratuitous to add: ‘for better and for worse.’
There’s a moment in this interesting mash-up where the speaker implies the following: could we re-render human brain to think more like a machine? This follows from the difficulty of making a machine think like a human.
I had to look up the use of the term ontologies because I know little about information science, and, the its use in the video seemed to depart from the philosophical term. Here’s the treatment about ontologies at wikipedia.
There is nothing about the problems faced by the varieties of user. I’m a user and I know of the problems I encounter in searching for information, both on the internet, in libraries, and, on my own computer, in my own archive of documents.
I’ll mention three challenges. I’ll frame this by stating that I wish my computer-based archives and library archives were indexed by google.
(1) usually, (my) searches for information on google are satisfied. However, because the results are matched with the real-time indexing my cognition provides for, the end of a search on a given topic–usually in the social sciences–is arbitrarily terminated. In other words, I have conclusive idea that a given result is the optimum result. I’d also characterize my search methods using partly ad hoc heuristics.
(2) searches in my computer-based archive are brute force and leverage Spotlite’s ability to look into the text of every file, BUT, involve scanning through very long result lists, most of which are not positive. As a user, the labor intensive task of organizing files on my end is, ‘too much.’ And, fit to this is the ease with which information can be archived versus the labor involved in organizing it. Somewhat: the intuitive’s curse…
(3) The most difficult search of the web and internet resources are those that are very particular and very local. A good example would be somebody’s address. Searches oriented to topics do not fall into this category.
One other note–I would guess my own search capability falls into the highly capable slice of any Bell Curve. This guess is based in my understanding of how to use the specific editing features of google search. And, it’s based on observing how most other people use search. One of the challenges for the semantic web, given,
The Semantic Web is an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it.
is any useful, more powerful interface and facilitation, has to meet the different modes of differentiated users.
For example, I wouldn’t be skeptical of a machine’s ability to qualify results so that I could be confident I’ve reached the optimum set of results, but I’d like to know beforehand why I needn’t be skeptical. And, this would have to be presented to me at my level.
new book announcement plus related blog
Title: The Discourse of Blogs and Wikis
Series Title: Continuum Discourse Series
Publication Year: 2009
Publisher: Continuum International Publishing Group Ltd
http://www.continuumbooks.com
Book URL: http://www.continuumbooks.com/books/detail.aspx?BookId=132398&SearchType=Basic
Editor: Greg Myers
Hardback: ISBN: 9781847064134 Pages: 192 Price: U.S. $ 150.00
Hardback: ISBN: 9781847064134 Pages: 192 Price: U.K. £ 75.00
Paperback: ISBN: 9781847064141 Pages: 192 Price: U.K. £ 24.99
Paperback: ISBN: 9781847064141 Pages: 192 Price: U.S. $ 44.99
Abstract:
Blogs and Wikis have not been with us for long, but have made a huge impact
on society. Wikipedia is the best known exemplar of the wiki, a
collaborative site that leads to a single text claimed by no-one; blogs, or
web-logs, have exploded into the mainstream through novelisations, film
adaptations and have gathered huge followings. Blogs and wikis also serve
to provide a coherent basis for a discourse analysis of specific web
language.
What makes these forms distinctive as genres, and what ramifications does
the technology have on the language? Myers looks at how blogs and wikis:
*allow for easier than ever publication
*can claim to challenge institutional hierarchies
*provide alternate perspectives on events
*exemplify globalization
*challenge demarcations between the personal and the public
*construct new communities and more
Drawing on a wide range of popular blogs and wikis, the book works
alongside an author blog – http://thelanguageofblogs.typepad.com/ – that
contains regularly updated links, references and a glossary. An essential
textbook for upper level undergraduates on linguistics and language studies
courses, it elucidates, informs and offers insights into a major new type
of discourse. This coursebook includes a companion website for student and
lecturer use.
it’s the blog on “the language of blogs” which appears to be a very good resource, with a lot of links to recent work on blog research, other blogs related to online research, and posts of relevance to our own interest. i think i might need to comment on some of those posts….
All That Will Be Left Is Language

Is Technology Dumbing Down Japanese?
Emily Parker, New York Times, November 5, 2009 | src
excerpt:
Now the Japanese language is being transformed by blogs, e-mail and keitai shosetsu, or cellphone novels. Americans may fret over the ways digital communications encourage sloppy grammar and spelling, but in Japan these changes are much more wrenching. A vertically written language seems to be becoming increasingly horizontal. Novels are being written and read on little screens. People have gotten so used to typing on computers that they can no longer write characters by hand. And English words continue to infiltrate the language.
conference abstract: graphic visualisation
MODELLING AND VISUALISING DISCOURSE PATTERNS
Bandar Almutairi, University of Sydney, Australia;
Michele Zappavigna, University of Sydney, Australia.
Texts can be intractable. As discourse analysts, we are limited by the extent to which our perceptual systems can detect long-range and complex patterns in discourse, even where we have manually annotated the data. Since a text is more than a bag of words, clauses or any other structure (Martin, 1985) we need technology that can assist the analyst in achieving both a synoptic and dynamic perspective on their text analyses. This paper develops a text visualisation strategy that leverages periodicity, how information is organised as a text unfolds (Halliday, 1985; Martin & Rose, 2007). Since periodicity is “concerned with information flow – with the way in which meanings are packaged to make it easier for us to take them in” (Martin & Rose, 2007: 188), we argue that the intangible time of a text can be measured by a complex unit based on this concept. We use mathematical interpolation to produce representations of waves of periodicity that can be used as a time reference helping us to visualise the distribution of other linguistic systems (e.g. Appraisal, Process-Type etc.) throughout the text. We use this method to detect patterns in these features in terms of their relative distance from the peaks of the waves. The method can be used recursively (e.g. nested functions; functions of functions) to create waves of waves corresponding to patterns of patterns at the same stratum or generalized to include components from higher or lower strata in language. We apply the method in a pilot study to compare the unfolding of prosodies of evaluative meaning in two texts annotated using Appraisal Theory (Martin & White, 2005). A long term aim of this project is to develop a metalanguage, as Zhao (forthcoming) has suggested, for describing the kinds of logogenetic patterns, in other words, patterns of unfolding meaning, that are possible in texts.
their work is very much related to our ongoing interest here, in visualising the dynamics of interaction. data visualisation. interaction is mediated through writing, or recorded in writing using a transcript of spoken conversation, or having a video text of a multi-modal event. such a transcript can then be analysed using any number of approaches or frameworks. the next step is to create a ‘transformation’ of that analysis into a diagram which represents the interaction according to whatever elements or figures we are interested in examining – and then further transforming at will those elements cross-referenced by other elements or figures to reveal correlations and new figures that are not immediately obvious from raw analysis alone.
The Prediction of Desire
Mining the Web for Feelings, Not Facts
New York Times
By ALEX WRIGHT
Published: August 23, 2009
(excerpts)
1.
Computers may be good at crunching numbers, but can they crunch feelings?
The rise of blogs and social networks has fueled a bull market in personal opinion: reviews, ratings, recommendations and other forms of online expression. For computer scientists, this fast-growing mountain of data is opening a tantalizing window onto the collective consciousness of Internet users.
An emerging field known as sentiment analysis is taking shape around one of the computer world’s unexplored frontiers: translating the vagaries of human emotion into hard data.
This is more than just an interesting programming exercise. For many businesses, online opinion has turned into a kind of virtual currency that can make or break a product in the marketplace.
2.
Jodange, based in Yonkers, offers a service geared toward online publishers that lets them incorporate opinion data drawn from over 450,000 sources, including mainstream news sources, blogs and Twitter.
4.
Such tools could help companies pinpoint the effect of specific issues on customer perceptions, helping them respond with appropriate marketing and public relations strategies.
5.
While the more advanced algorithms used by Scout Labs, Jodange and Newssift employ advanced analytics to avoid such pitfalls, none of these services works perfectly. “Our algorithm is about 70 to 80 percent accurate,” said Ms. Francis, who added that its users can reclassify inaccurate results so the system learns from its mistakes.
Translating the slippery stuff of human language into binary values will always be an imperfect science, however. “Sentiments are very different from conventional facts,” said Seth Grimes, the founder of the suburban Maryland consulting firm Alta Plana, who points to the many cultural factors and linguistic nuances that make it difficult to turn a string of written text into a simple pro or con sentiment. “ ‘Sinful’ is a good thing when applied to chocolate cake,” he said.
The simplest algorithms work by scanning keywords to categorize a statement as positive or negative, based on a simple binary analysis (“love” is good, “hate” is bad). But that approach fails to capture the subtleties that bring human language to life: irony, sarcasm, slang and other idiomatic expressions. Reliable sentiment analysis requires parsing many linguistic shades of gray.
“We are dealing with sentiment that can be expressed in subtle ways,” said Bo Pang, a researcher at Yahoo who co-wrote “Opinion Mining and Sentiment Analysis,” one of the first academic books on sentiment analysis.
To get at the true intent of a statement, Ms. Pang developed software that looks at several different filters, including polarity (is the statement positive or negative?), intensity (what is the degree of emotion being expressed?) and subjectivity (how partial or impartial is the source?).
For example, a preponderance of adjectives often signals a high degree of subjectivity, while noun- and verb-heavy statements tend toward a more neutral point of view.
As sentiment analysis algorithms grow more sophisticated, they should begin to yield more accurate results that may eventually point the way to more sophisticated filtering mechanisms. They could become a part of everyday Web use.
Code-swarm, anyone?

