As you like it

Already sick of the sweetness of all the heart-shaped confectionary, pink hot chocolates, and simultaneously both heart-shaped and pink lingerie? Can’t take another rendition of Puppy Love? Or just want a break from figuring out a larger-than-life Valentine’s gift with less than a week to go? Fear not, even with Valentine’s Day coming up, I am not here to tell you how to express your affection in a dozen languages nor to provide you with an etymology of the word love; I am here to water down love to a variety of like which does not even mean, like, like.

Love is all around me - and so is like. Peggy2012CREATIVELENZ.

Now liking something is one thing – to like, as emotionally complex as it may be, is grammatically relatively straightforward. You use the verb like when you like banoffee pie, George Clooney, or analyzing linguistic data, and even without any knowledge of linguistics you will have a fairly clear idea of what sort of an element this like is (a verb, in case your intuition failed you). But when you strip away the meaning of like, you are suddenly looking at a homophone with a range of functions that it is much harder to put your finger on.

Just think about like in all these instances:

  1. George Clooney acts like a true Hollywood star.
  2. He has won an Oscar like all good actors should.
  3. The wax sculpture of George Clooney at Madame Tussauds is so life-like that many a fan has been fooled.
  4. I mean, there were fans of him like fainting.
  5. So I was like, come on, it’s not the real deal.

To capture these uses of like, like in this sense (as opposed to the verb like) has been dubbed aparticle’. This is something of a dustbin category, filled with a wide array of words that do not have their own lexical definition – think of words relating to grammatical categories, such as negation (not), or those little things that you utter to connect and organize different bits of what you are saying or to express attitude (known as discourse markers – well, anyway, firstly). To continue the name-dropping, in the first sentence like is used as a preposition, in the second it is a conjunction, then a suffix, and finally a non-quotative and quotative complementizer. Sometimes you wish all words could be like (preposition here) the verb like.

"I was like OMG it's George Clooney like taking a selfie like a normal person!"

By far the most controversial use of like is the final, quotative complementizer one (sentence 5), the structure be like. I was recently at a dinner where an elderly professor, upon hearing that I did linguistics, launched on a rant about how his grandchildren’s speech was dotted with atrocities such as “I was like that’s so cool.”

As with so many linguistic things that people like (!) to condemn (see Chris’s post about code switching from last week, or my earlier blog about pronouns), be like is not a result of an evil mastermind corrupting language, or even adolescents expressing their teenage angst by not speaking ‘properly’. Rather, it would seem to have developed along a very commonsensical pathway, attested in the evolution of different lexical items across languages.

The key to understanding how the quotative complementizer be like came to be (and to annoy so many people) lies in its multiple uses. The starting point here is like as a preposition. As prepositions do, it only precedes noun phrases or perhaps phrases with other prepositions (like a true Hollywood star); however, it is not difficult to imagine that this class could be extended to include whole sentences – et voilà, we have derived from the prepositional use in sentence 1 above the conjunction one in sentence 2 (like all good actors should). Consider now sentence 3 and like as a suffix (life-like). Combined with the preposition and conjunction like, like can now be used both before and after the thing it modifies. This sort of detachability and mobility characterizes discourse markers, and a final touch of adding be – after all, English sentences must have a verb – completes the development of the quotative be like.

This may seem all very well and logical for the solitary case of like but this alone does not make a pathway of development in any way natural. Zooming out, however, you can observe a more general change across different functional components of language. First we have the propositional component, the resources that make it possible to talk about something: the preposition like falls into this category. The conjunction like, in turn, belongs to the textual component, providing means of creating a cohesive discourse. Finally, there is the interpersonal or expressive component for expressing personal attitudes, and this is where like as a discourse marker fits in. Like is not alone in going from propositional through textual to interpersonal. Just consider the different functions of why:

  1. Why hasn’t Geroge Clooney won more Oscars?
  2. I don’t understand why such a great actor doesn’t get all the awards.
  3. Why, that’s just bizarre.

Here the first why is clearly propositional, expressing that the sentence is a question, the second why connects the two sentences, and the last one expresses the speaker’s attitude. Why, this looks a lot like like!

Of course, elements do not change and spread their new uses on their own, no matter how tempting a pathway there is available. The origins of be like have been traced back to the US and the early 80s; from there, it had reached the UK shores by the mid-90s, not least through the media. As with so many linguistic innovations, be like is typical of younger speakers: the ratios of be like decline with age, so that it is most commonly found in the speech of under 30s and high-school students. Again, this age demographic is nothing but natural to language change – and probably the reason it attracts the dislike of language purists.

From America into dictionaries - like is here to stay. Trevor.

Like it or not, be like is a prototypical example of language change: it follows universal pathways of change, spreads through younger speakers, attracts a lot of emotion, and is here to stay. In the words of Wet Wet Wet, “Like is all around me [–]/ it’s everywhere I go, oh yes it is.”

If you would like a more in-depth tour of the wonders of like, have a read of Romaine and Lange’s study, on which the discussion here was based:

Code-switching: why I am hated by Chinese language purists

Almost two and a half years ago, I created an account on a Chinese question-and-answer website and started to answer some questions regarding linguistics and languages raised by laymen Chinese netizens. It is my pleasure to introduce linguistics to the public in Chinese – after all, “modern western linguistics” itself is a relatively new concept to teenagers and young adults, and some people are curious about that. Although most of the questions I came across are rather basic and some not very scientific (for example, someone may question me if Chinese really has grammar, which is not a very good question from my point of view), I am more or less satisfied if people can know more about the scientific studies of different languages, and even decide to take a course in linguistics.

However, I constantly receive some criticism from certain audiences, not about the content of my answer, but about the way in which I deliver it, or, to be more specific, about the languages I use when I talk about those topics. They told me politely or impolitely that they were “annoyed” by the use of English terminologies in my answers, and by putting English words in a Chinese article, I was “damaging the beauty, purity and integrity of Chinese”. Someone even “threatened” me that he would give up reading my answers because he cannot stand the English words in them. Before I started this blog, I did a calculation on my ten most recent answers to linguistic questions, and found that over 90% of the text was written in Chinese, though I did use a lot of English terminology when I talked about some linguistic theories.

I am going to protest my innocence here – I really do not choose to do so. I received most, if not all, of the content related to linguistics in English, and currently I see English as my major working language. The influence of English is so deep-rooted that whenever I would like to refer to a ready-made concept in linguistics, such as “psychotypology” (see my previous post for more details), “telicity”, or “aspect marker”, the first words that pop up in my mind will be in English, and I barely know their translation equivalents in Chinese. Therefore, almost every time I intend to introduce a new concept to my audience or refer to the key research methodology, I will put it in English first and try my best to give an appropriate Chinese translation, while the rest of my answer is in Chinese.

Maybe you have already realised that this phenomenon is called code-mixing or code-switching. In sociolinguistics and second language acquisition, it is a long-standing topic and has been investigated from various perspectives, including the roles of the matrix language (the language that forms the grammatical structure of a chunk of utterance) and the embedded language (the language that provides the “mixing” words in the chunk), the social identity of code-switchers and opinions from the surrounding, and other possible areas. I am particularly interested in the word types, the pedagogical methods and the linguistic code appearing in code-switching. According to my personal observation of myself and friends with similar sociolinguistic background, whether you switch the code to the filler language depends on two different factors. The first one is what type the word or phrase is, namely content words (nouns, verbs, adjectives, etc) or grammatical words (pronouns, prepositions, conjunctions, etc). The Differential Access Hypothesis, which is a well-discussed framework of code-switching, assumes that content words and morphemes appear more frequently in the embedded language while grammatical morphemes tend to appear in the matrix language (Myers-Scotton 2005), and a number of finely recorded or anecdotal data can support this hypothesis.

More importantly, the use of code-switching is somehow related to the environment and method you acquire the word by – actually, it is related to how you add the new word in the second language to your “mental all-language dictionary” – the multilingual lexicon. We assume that we store the vocabulary we know in different languages in a unified structure, and a simple illustration of a possible bilingual lexicon, which is developed by Kroll and Stewart (1994), is listed here.

The Revised Hierarchical Model by Kroll and Steward (1994). Figure from

This proposal of the bilingual lexicon provides a vast amount of information for us to raise different hypotheses about how bilinguals memorise and use words. We can also get some inspiration from it to explain the choice of code-switching. For sequential bilinguals like me, namely people who start to acquire the second language after mastering their first language fluently, we start learning words of a second language by matching the words to the translation equivalents. For instance, when I first learnt the word “apple” (which is among my first English words), I did not establish the link directly between the the English word “apple” and the round red juicy fruit, but made an interchange at the Chinese word “pingguo”, since I learnt the word via Chinese translation and I needed to rely on the Chinese word to retrieve the English word at an early stage of acquisition. That is the reason that the link from one’s second language to one’s first language is much stronger than the link between the concept and one’s second language. In that situation, since we get a stronger connection from the concept to the first language than to the second language, we will continue to use the first language words in the utterance. That is applicable to some content words and most grammatical words, which can also explain why we always use our first language as the matrix language of code-switching.

Things become different when we acquire a new word in the use of a second language. This includes some second language immersion programmes (you can see a lot in Cambridge every summer), using the second language as the working / teaching language (which is the case when I was in Hong Kong and the UK), and other environments in which you do not always use your first language. In that situation, we can directly build up the link between the form of a second language word and the concept it represents, and there is no need to refer to the first language vocabulary anymore. When we retrieve these words and examine the connections between concepts and word forms, we will find that the second language words are more readily used, and we are tempted to use them even if the rest of the utterance is organised in another language. Linguistic terminology in English is a good example for me: the imbalanced bilingual lexicon drives me to use English words when I think of the concepts, because that is the easiest way to do so.

Code-switching is definitely more complicated than word switching, but I just want to present an alternative viewpoint to the whole picture. Maybe the language purists simply cannot understand the use of two different languages at the same time, and they can only attribute such things to showing off, but I believe that the motive is more cognitive. Previous investigation of code-switching indicates that code-switching always comes with additional cost: code-switchers need more time to prepare for articulation (or typing, of course), and they may need to activate the words in the other languages and adjust the structure of sentences as well (see Meuter and Allport 1999; Meuter 2005). Nevertheless, my friends and I do code-switching all the time, and we do it for communicative purposes: if the English word can better deliver the intended message, why not use it in an utterance? After all, as intelligent animals, humans will not do anything without considering its convenience, and language use is no exception.


Looking for lost voices

In modern language sociolinguistics we are often interested in investigating the speech of specific social groups. We might compare the speech of people from different ethnic groups, or different socio-economic classes or genders. Alternatively, we might investigate differences in language use in different contexts. How do people use language differently in formal contexts like job interviews as compared with informal contexts, like chatting with friends in the pub?

In either case, the first step is to collect data: to record language use by the different groups of people we’re interested in, or in the different contexts we’re interested in. But what can we do when that’s impossible? When we’re investigating historical languages, we’re limited to whatever language happens to have been written down and whichever bits of writing happen to have survived until the present. That’s normally quite a skewed sample in lots of ways. In many historical periods only certain social groups (typically wealthy, powerful men, often particularly those associated with the church) learned to read and write, and so only those social groups leave a written record of their language. Furthermore, language was only written down in certain contexts: records of laws and legal proceedings, religious writings, financial transactions and perhaps narrative literature and poetry. Unlike with today’s social media, casual everyday interactions did not take place in writing. So how can we investigate the language of other social groups, or language use in informal contexts?

One possible answer is by investigating reported speech in fiction. Unlike scribes and the authors of texts, characters in fiction may come from a wide range of social groups and fiction may describe everyday interactions, providing us with data to investigate.

Obviously we can’t assume that the language used by characters in fiction was identical to the language of similar real people in society at the time—authors will undoubtedly have been best at representing their own language and the language of the social groups with whom they normally interacted. However, represented speech in fiction is often used as an expressive tool to represent the very social phenomena that we’re interested in, which encouragingly suggests that we should find interesting variation to research (Kiełkiewicz-Janowiak 1999:59; Culpeper 2009:81, 307). Better still, parallel research on language in modern fiction does suggest that language use by characters from particular social groups can reflect the language used by those social groups in reality. Work on language use by male and female characters in Japanese sitcoms has found that language use by female characters has many of the same features which typify spontaneous speech by female speakers. Features which people are quite well aware of and make use of for stereotyping can be even more pronounced in the fictional speech than in real speech (Shibamoto 1987:48; Shibamoto Smith 2004:126). Similar findings have been reported for male speech (Occhi, SturtzSreetharan & Shibamoto Smith 2010) and for Japanese novels rather than sitcoms (Shibamoto Smith 2004).

So if this works for modern languages, we should also be able to do it for historical languages, right? And some researchers have done just this. Research on Latin texts seems to show that male and female characters make slightly different choices of words (Adams 1984). Work on Classical Greek drama has shown differences between the speech of male and female characters in terms of choices of words (Bain 1984; Sommerstein 1995), choice of conversation topics, rhetorical structures (Mossman 2001), and choices of pronouns (Meluzzi 2010). Willi’s work on differences in grammar and choice of words in the speech of female and male characters in Classical Greek comedy goes further still, showing that what differences there are were understood in similar ways to gendered differences in modern languages: female characters used more politeness features and more innovative features (Willi 2003:176–195), and speech had more such characteristics in single-gender groups than in mixed groups (Meluzzi 2010:96–98; Willi 2003:196).

This stuff is really exciting. Classical Greece and Rome were incredibly sexist societies: very few women learned to read and write and vanishingly little written material by women survives. So, language use by fictional characters may be our only possible window on the language of Greek and Roman women in this period.

In my own work, I’ve tried to go one step further. The studies cited above all looked language in fiction in just one time period. Studying language in Old Icelandic fiction, I’ve taken represented speech from texts spanning almost three centuries to try and find out whether the way that female and male characters were involved in changing language over time is similar to the way we know that people of different genders are involved in language change in modern societies. As I mentioned in an older blog post, a common pattern in modern societies is that women are found to lead in language change, using more of a newer form earlier than men. And the results of my study do seem to show a similar pattern for one change which was taking place in Old Icelandic. As an older form is replaced by a newer one over several centuries, the represented language of female characters seems to stay about 15-20% ahead of the language of male characters.

Unlike with modern language studies, we’ve no way of then going and confirming that this language use in fiction really did reflect the use by real people. Nevertheless, it’s exciting to get a hint of patterns like this which would otherwise be lost! If you’re interested to read more about my study, you can find the paper on my page.


Now you hear it …

In this post I will discuss one of the more important (and, in some quarters, more controversial) ideas in modern theories of grammar. This is that there are some elements (“words”, if you like) which are present syntactically but phonologically have no form – which means they are there but we can’t see or hear them.

There are all sorts of examples of this. Take, for one, the following sentence:

Which kitten did Lucy buy?

Now, which kitten here is semantically the object of buy – compare the sentence Lucy did buy a kitten. That second sentence is an example of the fact that in English objects usually come after verbs. But this doesn’t seem to be the case in Which kitten did Lucy buy? For this reason and others linguists have suggested that which kitten starts off after the verb and is “copied” to the front of the sentence: but only the first copy you reach is pronounced. So the sentence really looks something like this:

Which kitten did Lucy buy which kitten?

Another example is not from English. Lots of languages allow pronouns like and they to be left unspoken in some contexts. Spanish is a good example:

Vivo en Cambridge.
live-I in Cambridge

There is no pronoun here (the same meaning is conveyed through the suffix -o on the verb). But there are good reasons for believing all sentences need to have a subject, even in a language like Spanish. So it’s suggested that there is a pronoun there, in the normal place, it just has a “zero phonetic realisation” – it doesn’t contain any sounds that need to be pronounced, so you can’t hear it.

There are examples a bit like this in English to. Here’s another sentence:

It is important to feed yourself.

yourself is a sort of word called a “reflexive pronoun”, which basically means it needs to refer back to something earlier in the sentence. Hence in a sentence like You like yourselfyourself refers back to you. But in It is important to feed yourself there’s nothing for yourself to refer back to. Therefore, we can postulate a silent pronoun, which is also useful as it gives us a subject for to feed, a bit like the following:

It is important you to feed yourself.

(Note that, in this instance, the sentence would be ungrammatical if the pronoun was pronounced.)

As a final, relatively easy example, compare the following two sentences:

Harry said that he would freeze the fish-fingers.
Harry said he would freeze the fish-fingers

Spot the difference? The two sentences are identical except that one has that in and one doesn’t. One way of looking at this is to claim that the second sentence does contain an element equivalent to that, it just happens to be silent.

So we can conclude from this and many more examples that not everything in a sentence is necessarily pronounced, and that we can learn a lot about language by looking beyond what we hear to things that we don’t.

Game Theory

Tis the season to be merrily playing board games! Recently we were given a rather good new one by some friends, called Hanabi, and all our guests and family members have been subjected to it. It’s definitely a game for a pragmatician like me.

My interest was first whetted when I was told, upon presentation, that it is a co-operative game. Now, regular readers of this blog might remember previous posts on Pragmatics, introducing a chap called Paul Grice, a British philosopher and linguist whose thinking is foundational for much present-day pragmatics. His big thing was The Co-operative Principle: “Make your contribution [to the conversation or exchange] such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged.” Now, you might think that such a statement itself is a bit obtuse and not very, well, co-operative. But it’s basically saying, say the right thing at the right time in the right way.

We do this all the time when we communicate. When I woke up the other morning and exclaimed to my husband ‘the bin!’, he had no trouble inferring that I meant something like, “help! it’s a Thursday – we must get the black wheely bin out at once to avoid having smelly rubbish on our hands for the next fortnight!” But, in another context, that might not have worked, or it might have meant something entirely different. Most of the time we know how much information we need to convey to our conversation partner.

But what happens when there are some extra constraints on our communication? That’s where the fun of Hanabi starts. All the players have to work together against the game, to construct a wonderful fireworks display, being “absent-minded firework manufacturers who accidentally mixed up powders, fuses and rockets, [with] the show about to start and panic setting in” (though feel free to ignore the back story). Everyone holds their cards facing outwards, so I can see everyone else’s cards, containing elements of the fireworks display, but not mine. Players have to communicate pieces of information to other players, so that they know which of their own cards to play or discard. ‘Easy!’, you might be thinking. But just wait a second. In any one turn you can only communicate one piece of information about the number or the colour of the cards in one other person’s hand. So for example, if I’m looking at your cards (below), I might say “you have two fours”, or, alternatively, “you have one red”, and point those cards out.


So, as the speaker, I have to decide not only what the most useful thing for another player to know is, but also how I can communicate that to them. I have to take into account what we both know about the game so far, what the most salient aspects of the game currently are, what information other people have already communicated and how. In other words, I have to ‘do pragmatics’, obeying the Co-operative Principle – what we all do all the time when we’re chatting, or writing a blog post. But the difference here, playing Hanabi, is that, like the cards, it’s all on the table. The reasoning I’m doing about what I want to implicate and what inferences my chosen other player will make is conscious (and sometimes somewhat tortuous). Usually making a co-operative contribution to some conversation requires complex but pretty subconscious reasoning; in Hanabi, the interesting twist is that both speaker and hearer are paying conscious attention to it. And I wonder what difference that makes?

From my own anecdotal experience, it can be extremely difficult, as the hearer – the player given information to work out what someone else intends me to infer – precisely because I’m giving equal consideration to the numerous options: ‘now, has he told me that’s a red, because he wants me to play it now, or play it later, or it’s no longer useful and can be discarded, or…?’ That’s the kind of quandary we usually only find ourselves in in situations of miscommunication, when it’s really unclear what someone meant. (Another domestic example: Me – have you washed the pots? Husband – yes. Me – but they’re still muddy. Him – oh, I thought you meant pans, not potatoes!).

Just in case you’re wondering whether this is a nice ol’ ramble, but a bit far removed from any serious linguistic content: a couple of pragmaticians1 did actually conduct a study not a million miles removed from Hanabi, in which participants had to communicate which object an interlocutor should pick up using only colour or shape, to find out whether speakers can refer to objects optimally, and hearers can interpret as the speaker intended. (Answer: yes, but it’s complicated…)

How would you pick out each of these objects in the row, with only a shape word or a colour word?

How would you pick out each of these objects in the row, with only a shape word or a colour word?

My experience of Hanabi, though, makes me wonder how much people’s communicative behaviour changes when they’re placed in such a peculiar game situation, with conscious attention paid to communicating co-operatively; how much does it tell us about everyday linguistic reasoning?

But for now, it’s back to some more playing at pragmatics.
1 Ciyang Qing and Michael Franke (2015). Variations on a Bayesian Theme: Comparing Bayesian Models of Referential Reasoning. In: Bayesian Natural Language Semantics and Pragmatics, Ed. by Hans-Christian Schmitz and Henk Zeevat, Heidelberg, Springer

Meri Kurisumasu!

It is that time of the year again: an overdose of Wham! on the radio, the annual parade of cheesy jumpers, an increased interest in “working from home” (somehow clustering around days after office Christmas parties), and, for anyone taking a beginners’ language class, the time to learn seasonal greetings in the target language. For my Japanese class, this turned out to be a bit of a no-brainer: Merry Christmas is rendered into Japanese as メリー クリスマス, or Meri Kurisumasu. With the linguistic aspect of our final class memorized in seconds, everyone could happily focus on creating appropriately kawaii origami Christmas cards.

Uh oh, it's that time of the year again.

Uh oh, it’s that time of the year again.

Christmas has been celebrated in Japan only for the past two decades or so, being largely a cultural import from the USA (with uniquely Japanese features – Japanese Christmas is definitely not a religious festival and has more of a Valentine’s day touch to it). As such, as with so many borrowed things, it is not surprising that the word for Christmas, along with the whole greeting, is also borrowed. Loanwords such as these are used directly from another language with little or no translation, and are a handy thing to have around when in search for le mot juste; how would you express the concept of Schadenfreude without a little help from German (or, for that matter, any of the concepts on this list of words English has borrowed)?


How many loanwords are adopted into a language and what happens to them depends on many non-linguistic factors. The Académie française, the French council for matters pertaining to the French language, is notorious for its abhorrence of the likes of le weekend, while the Finnish equivalent has a history of failed attempts of introducing Finnish alternatives to incoming loanwords: sohva ‘sofa’, for instance, was rendered joukkoistuin, literally translated as ‘group seater’.

What's all this group seater business about? Laura Bittner.

Whatever the perceptions of the acceptability of loanwords, borrowings often undergo processes of change to fit in more comfortably with the phonology of the borrowing language. Hence loanwords, even if superficially very similar to the original, are rarely pronounced in exactly in the same way as in their language of origin: don’t expect your English version of French bric-a-brac or German Doppelgänger to make you sound like a native of either language. Interestingly for the linguist (and all the readers of CamLangSci!), the sorts of processes loanwords undergo can inform us about the phonological structure of the borrowing language.


What about Meri Kurisumasu? It turns out that this easy-to-learn Japanese-for-dummies greeting from my Japanese class is a linguistic gold mine of Japanese sound structure:


  • Japanese does not allow complex onsets in its syllables, meaning that consonants may not generally occur immediately next to each other but must be separated by a vowel. Hence, the first sound in Christmas, [kɹ] ([ɹ] is the phonetic description for the sound spelled as r in English), becomes [ku] in Kurisumasu, and likewise there is an additional [u] in the middle of the word, Kurisumasu, to break up the [sm] sequence in Christmas.


  • The letter r in Meri Kurisumasu could be replaced by l and not make any difference for a speaker of Japanese; in fact, the two sounds are not differentiated in the Japanese writing systems (リ, for example, can be used to render either [li] or [ɹi] in English words). This follows from the fact that [ɹ] and [l] are allophones, and not phonemes, in Japanese. A phoneme can create a difference in the meaning of a word, while allophones are different realizations of the same phoneme. In English, for instance, right and light differ only in their first sound, their difference in meaning showing that [ɹ] and [l] in English are separate phonemes. In Japanese, they would be perceived as the same sound and there would be no difference in meaning.


  • Although Kurisumasu ends in a u in its spelling, this sound is not in fact pronounced in Tokyo Japanese. u is what is known technically as a devoiced vowel, or, more common-sensically, as silence. So why does it appear in the spelling, I hear you ask. For instance, while you can’t hear the u in normal speech, it is pronounced in highly emphatic speech. And in Japanese poetry, this silence counts as a unit of structure just as a vowel you can hear would. (Yes, one of the weird and wonderful aspects of linguistics is positing things you can’t always hear or see!)
Silent vowels are real... sounds a lot like what you tell children about Santa! Andrew Roberts.

Given these basics of the Japanese sound system, why not convert some of that excess mincepie energy into brain power and figure out what abunoomaru, ueetoresu, salaliiman (all from English), and alubaito (from German) mean in Japanese? (And if that has left you under the impression that English is more of a lender than a borrowerthis serves to show that it is no way a one-way relationship between Japanese and English.)


With that, I wish a Meri Kurisumasu (whatever form this may take for you) from all of CamLangSci to all of you!

You was

Amelia by Henry Fielding - title page

Earlier this year I was reading Henry Fielding’s 1751 novel Amelia. A matter of linguistic interest that struck me was the frequent use of the phrase you was, nowadays stigmatised as pretty firmly non-standard English, and certainly not something you particularly expect to find used by posh characters in a classic novel. But there it was, repeatedly, for example:

  • “Indeed, Will, you was a charming fellow in those days.”
  • “I ask your pardon, madam,” said the doctor; “I forgot you was a scholar.”
  • “If you was a reasonable woman,” cries James, “perhaps I should not desire it.”

You might wonder if you was was just something everybody used in the past, and the modern you were is a more recent innovation. But in fact you were is definitely the older term. The paradigm of to be in the past tense in English in about the sixteenth century was something like the following:

I was we were
thou wast ye were, you were
he/she/it was they were

were is used with the plural forms and was(t) with the singulars. But also about this time a change was underway in the second person which saw the old present singular thou replaced with the plural you, giving us the modern paradigm as follows:

I was we were
you were you were
he/she/it was they were

So what’s going on in Amelia? Here, we see almost exclusive use of you was: 33 instances, against only 1 of you were. (All of these are singular or ambiguous as to number; that is to say, there are no clear instances of plural you with a past tense of be.

This looks like a case of a historical process called analogy. By this point, you has taken over as pretty much the sole second-person pronoun, replacing thou in the singular. (There are no instances of thou wast, and across the whole text the older forms thouthee and ye are very much less frequent than you, which makes up about 98% of uses of second-person pronouns.) But this creates a disuninformity in the paradigm, one which is still present in standard English today: were is no longer an exclusively singular form. Some speakers in Fielding’s time clearly decided to get around this by extending was to (singular) you:

I was we were
you was you were
he/she/it was they were

This paradigm makes was the sole form in the singular, and reinstates were as the only form in the plural. (Of course we can’t tell on the basis of the Amelia data alone if you were was retained in the plural, as we don’t definitively have any relevant uses of plural you, but there is evidence from other texts from the same time that this was the case.) 

But what’s also interesting is that this change didn’t persist. At some point the trend toward you was was reversed, and standard English went back to you were. This illustrates that changes in language don’t proceed inexorably toward some end goal: they can be, and sometimes are, halted midway. It has been suggested that, in this particular case, the change may have been reversed through the influence of an important 1769 grammar by Robert Lowth, which condemned the use of you was. In general, perhaps, prescriptive attitudes don’t have that much of an influence in terms of preventing changes in the long-term, but maybe in this rare instance we should give prescriptivists reason to take heart—perhaps their efforts aren’t utterly futile after all! 


I, you, and she in case and agreement

This post is about person in grammar, a notion that we split up into (at least) three categories called “first”, “second” and “third” person. In English, there are personal pronouns corresponding to each of these categories, like I, you, or she.

There are many things to be said about pronouns and person, but I’ll focus on one that I find particularly interesting and that figures prominently in my dissertation: the degree to which the person of a subject and an object can influence the form of a predicate or the form of the subject or the object itself in different languages (see also Jim Baker’s excellent related posts here and here).

While English has person and different pronouns, its verbal morphology is not very interesting, so I will start with Hungarian. We’ll see some slightly complicated verb forms first, and when you’re all confused, I’ll tell you a about a beautifully simple way of how person influences verb forms in Hungarian.

Consider the examples in (1). The verb form with a first person subject in (1a) differs from the form with a third person subject in (1b) (in both Hungarian and English). This is called “subject agreement”, since the verb “agrees” with the subject. (A word about the examples: the first line shows the example in the language we’re talking about, the second line provides some grammatical information and English translations. “1SG” means “first person singular”, for example. The third line provides a full translation.)

(1) Hungarian
    a. Én lát-ok.
       I  see-1SG
       ‘I see.’

    b. Ő lát.
       s/he see.3SG
       ‘S/he sees.’

But Hungarian verbs do not only indicate the person of the subject, they can also indicate the person of the object. We can see this if we add a definite object, a third person pronoun in this case, to the sentences above. Now “1SG>3” means “first person singular subject and third person object”. Cool, right? (Note also the case ending on the object: “ACC” for “accusative”.)

(2) Hungarian
    a. Én lát-om    ő-t.
       I  see-1SG>3 s/he-ACC
       ‘I see him/her.’

    b. Ő    lát-ja    ő-t.
       s/he see-3SG>3 s/he-ACC
       ‘S/he sees him/her.’

If we look at the English verbs, we see that their forms differ based on whether the subject is first or third person, but it doesn’t make a difference whether they have an object (say a pronoun like her) or not. In other words, for each (subject) person, there is only one form in English per tense. In Hungarian, there are several forms: a verb can agree with the subject only, as in (1), or with the subject and the object, as in (2). To make things even more fun for learners and linguists alike, this only happens with some objects, though.

In another interesting spin, it depends on the person of both the subject and the object whether both are indicated on the verb. If the person of the subject is first person and the object is third person, like above, the verb seems to indicate both (in other words, the verb shows subject and object agreement; I’ll indicate this as “1>3”).

What happens in other persons? When the subject is third person and the object is first person (3>1), does the verb also show subject and object agreement? It does not!

(3) Hungarian
    Ő    lát     engem.
    s/he see.3SG me
    ‘S/he sees me.’

If we look at the verb form in the last sentence and the one in (1), they are the same: for a verb with a third person subject, it does not matter whether the verb has an object or not, it shows the same form lát meaning ‘s/he sees’.

Confusing, right? One more thing on Hungarian, though: the best way to show the sensitivity to the person of both the subject and the object is with second person objects:

(4) Hungarian
    a. Én lát-lak   téged.
       I  see-1SG>2 you
       ‘I see you.’
    b. Ő    lát     téged.
       s/he see.3SG you
       ‘S/he sees you.’

In (4b), the verb only shows the person of the subject (as with in (3)), but in (4a), the verb shows the person of the subject and of the object: -lak only appears with first person subjects and second person objects. The two sentences in (4) have the same object, but whether the verb shows object agreement depends on the person of the subject!

OK, so what’s the beautiful pattern behind all this? Consider this so-called “hierarchy”:

(5) 1 > 2 > 3

To decide whether a Hungarian verb shows agreement with both the subject and the object, we have to look at whether the person of the subject is higher than the person of the object. If this is the case, say with a first person subject and a second person object, or 1>2, we see agreement. This kind of configuration is called “direct”.

But if the object’s person is higher, say a third person subject and a first person object, or 3>1, there is no agreement in Hungarian. Such configurations are called “inverse”. (This is not quite the whole story for Hungarian, but it’s the general pattern. There are some references below if you’re really interested).

So far, this was about agreement, i.e. the form of the verb. However, the same hierarchy also influences the form of the subject or the object in some languages. In Kashmiri, for example, the case of the direct object depends on the person of both the subject and the object. If the object’s person is higher than the subject’s person, the object appears in object case (like him or her in English, as opposed to he and she). Compare the following examples:

(6) Kashmiri
    a. bı     chusath tsı      parınaːvaːn
       I.SUBJ am      you.SUBJ teaching
       ‘I am teaching you.’

    b. tsı      chukh me    parınaːvaːn
       you.SUBJ are   I.OBJ teaching
       “You are teaching me.”

In the first one, the subject’s person is higher than the object’s: 1>2, a direct configuration. Therefore the object is in its subject form, i.e. the same as in example (6b).  In that example, the person of the subject (2) is lower than the person of the object (1), and therefore the object has object case, i.e. me meaning, well, ‘me’ (as opposed to in (6a) meaning ‘I’).

To show that the same thing holds for second person, let’s see the following examples. First, the subject’s person (2nd) is higher than the object’s (3), and therefore the object has subject case (compare su in both sentences!). In the second example, however, the object’s person is higher and therefore shows up in object case (tse rather than tsi).

(7) Kashmiri
    a. tsı      chihan su      parınaːvaːn
       you.SUBJ are    he.SUBJ teaching
       ‘You are teaching him.’ (literally something like ‘you are teaching he’)

    b. su chuy tse parınaːvaːn
       he.SUBJ is you.OBJ teaching
       ‘He is teaching you.’

Again, the hierarchy in (5), 1 > 2 > 3, gives us a way to describe what’s going on: only if the object’s person is higher than the subject’s does the object show case-marking. In other words, the object shows case-marking in inverse configurations.

There are many other examples of similar patterns across the world: some Native American languages have both case-marking (like Kashmiri) and verb forms (a bit like Hungarian, but more complex) that differ depending on the person of the subject and the object.

To give a final example, the language Awtuw, spoken in Papua New Guinea, requires that some objects appear in object case, but this does not just depend on person, but also on whether the object is more “animate” than the subject. And you need to know that humans count as more animate than animals, in this language.

(8) Awtuw
    a. Tey tale-re yaw dæli
       the woman-OBJ pig bit
       ‘The pig bit the woman.’

    b. Tey tale yaw dæli
       the woman pig bit
       ‘The woman bit the pig.’

According to Feldman’s grammar of Awtuw, the more animate, human argument (the woman) can only be the object if it is specially marked by the suffix -re. Rather than looking at person directly, for Awtuw we seem to have a hierarchy that indicates

(9) human > animate

Is there a way to combine humanness or animacy and person? Many linguists think so! They suggest that hierarchies are quite large, like the one in (10), and that they incorporate both person and humanness.

(10) 1 > 2 > 3 > human > animate > inanimate

Languages differ in how they lump several levels together: in Hungarian, humanness or animacy do not play a role in determining the form of the verb, for example. In Awtuw, on the other hand, they do in determining the form of the object. And obviously, many languages do not show these effects at all.

Languages obviously differ in whether such hierarchies influence agreement or case morphology, both, or neither, but there are nevertheless some very interesting generalisations that seem to hold across languages. “Special” marking like object case in Kashmiri or Awtuw tends to appear when the object’s person (or animacy) is higher than the subject’s but not when the subject’s person or animacy is higher than the object’s. It seems that direct configurations are “the norm”, while inverse configurations are “special”.

Why should this happen so regularly?

Some linguists suggest that the most typical kinds of subjects in transitive clauses tend to be high on the hierarchy, while objects tend to be low and therefore those constructions are expressed in a special way that diverge from this norm.

Another way to describe hierarchies is to assume that “1” and “2” represent more complex notions: 1 stands for the features “speaker, participant, person”, whereas 2 stands for “participant, person”, and 3 merely for “person”. This way of defining “person” makes first person the most specific and third person the least specific. First person always includes the speaker, but the reference of third person is much, much less restricted, and this might be a way of capturing this specificity in reference — and the fact that first and second person tend to behave in more “special” ways than third person.

To sum up, person, as inconspicuous as it is in English grammar, does fascinating things in languages all over the world, leading to case-marking here and agreement there — or in fact making certain sentences impossible. Jelinek and Carnie (2003) report of the Native American language Lummi that it is not possible to say “he advised us” (with a first person object):

“Speakers produce the example sentences comfortably until they are asked to say ‘He advised us’. Then they stop, look surprised and uneasy, and then if they are good consultants, after a while may say something like ‘Well, we don’t say it that way. You might say ‘We were advised’, but it’s not really the same, is it?’”


A Sign Corpus For All

On the 13th of November UCL’s Deafness and Cognition Language Research Centre (DCAL) celebrated its 10th anniversary. In ten years DCAL has had a profound effect on a number of areas, from Clinical Psychology to Education. One of the most exciting projects from a linguist’s perspective is probably their British Sign Language (BSL) Corpus Project. Before 2008 there was no large accessible collection of BSL signing. DCAL decided to address this gap and set out to collect signing data from Deaf participants from different areas of the UK. Ultimately signing data was collected from 249 Deaf people in 8 cities (London, Bristol, Birmingham Manchester, Newcastle, Glasgow, Cardiff and Belfast). Within these signers there were also different genders, ages, ethnic groups and occupations represented. Participants were interviewed, held conversations with other signers and were asked to provide their preferred sign for 102 different concepts (e.g. ‘America’ or ‘dog’). This gave DCAL a wealth of signing data unlike anything ever collected on BSL before.


Screen Shot 2015-11-23 at 11.07.04

So, why is this important?

  1. The project makes this data accessible to the general public. This means that signers, learners of BSL and linguists (including you!) can all look at videos of signing for any purpose.
  2. The corpus acts as a BSL time capsule. DCAL has shown that language change is happening very quickly in BSL and by having the BSL Corpus it is possible to keep a record of what BSL looks like now.
  3. Linguists can study the corpus to get a better understanding of the structure of BSL. This, in turn, influences the teaching of BSL and the training of interpreters.
  4. The corpus records the regional variety of BSL as well as the differences across age groups and genders. This is of particular interest to sociologists and sociolinguists. How would we have known before the corpus that there were at least 17 variations of the the sign PURPLE?
  5. Other countries have been spurred on to create sign language corpora and this may allow future comparison between different sign languages.
  6. In the future, DCAL will make the corpus completely searchable like the corpora of written or spoken language. Once it is machine readable it will be open to further research by computational linguists and may be more easily compared with corpora from spoken languages.
  7. The BSL Corpus Project has been used to produce a free online dictionary of BSL based on the signs provided by corpus participants. This is an invaluable tool for learners of BSL and contains over 2,500 signs from the different regions of the UK.

If you are interested in finding out more about the BSL Corpus, visit their website. You can also hear Dr Adam Schembri talking about the project 5 years ago on UCL’s Mini-lecture series here.

Twitter dialectology

Traditionally, we’ve found out about variation in how people speak—whether that be variation between people in different places, of different classes, genders, or whatever—by doing surveys. Dialectologists have travelled around the country interviewing a few people in each town to record how each would say a set of words. Sociolinguists have interviewed wide ranges of people from different educational and social backgrounds and looked for differences in how they speak. These sorts of methods have been very successful—but they’re also very costly. Sending out researchers to do dialectological surveys is an expensive business: many researchers are needed to carry out the long process of getting to know local people and finding some who are willing to be interviewed in every locality and all those researchers have to be paid for their time and travel. The reality is, there just hasn’t been the funding in humanities and social science research to do this sort of work on a large scale for some years and so much of our data is rather out of date.

But in the era of the internet and ‘Big Data’ there’s a new way of finding out about language variation: using social media. And so a new generation of research into language variation using language data from social media is just starting to appear.

Using social media data for research is a very different proposition to traditional survey data. Obviously, it’s mostly written rather than spoken data, which immediately puts some limits on the sorts of things it can tell us. More problematically, you can rarely find out as much information about each person in your study as in a traditional survey, and even what information you can find out is unreliable. As an interviewer in person the researcher can ask for more information when needed: ‘You say you’re from York—were you born and brought up there, or did you move around as a child? Were your parents also from York?’ But dealing with online data, the vast majority of the time what you see is all you get. You know what the user chose to write in the ‘Hometown’ box but not necessarily what they meant by it. You know where their phone was when they tweeted—but you don’t know if that’s the place that they live and were brought up, or indeed whether those are the same places.

Nevertheless, there is one big advantage to this sort of data: there’s lots of it. And a big enough quantity of data can often make up for low quality data, if we’re asking the right questions. Because of the uncertainties about who’s really behind the keyboard, we can rarely use social media to make definitive statements about how much a given group of people speaks or writes in a certain way (that would be statements like ‘people under 25 from London use the word order “give it me” 50% of the time and “give me it” 50% of the time’)—but we can make comparative statements (like ‘people from London use the word order “give it me” twice as often as people from Lincolnshire’).

To exemplify what sort of work is being done with social media at the moment, I’ll take you through a couple of interesting recent papers (links to both are found at the bottom of the post). Gonçalves & Sánchez (2014) gathered around 50,000,000 tweets written in Spanish and associated with a GPS location over two years. They then tracked lexical variation—variation in the words people choose to use to describe a given concept—to see if they could find differences in people’s language use associated with different places. The map below is reproduced from their paper, showing the different words used for ‘car’. As you can see, five distinct areas emerge: people in North America and northern South America largely use ‘carro’; people in Central America and in Spain usually use ‘coche’; and people in the southern half of South America generally use ‘auto’.


goncalves and sanchez cars

They then took results like this for many words and used machine learning algorithms (specifically K-means clustering) to investigate whether there were identifiable groups of dialects. The result was very surprising. Instead of showing big, regional dialects associated with contiguous areas on the map, the algorithm identified just two dialects: one associated with the big urban areas and one with everywhere else. Gonçalves & Sánchez write: “Superdialect α is utilized by speakers in main American and Spanish cities and corresponds to an international variety with a strongly urban component while superdialect β is comprised mostly of rural areas and small towns” (6). They see this as evidence for the homogenising effect of globalisation on language.

Eisenstein et al. (2014) focused not on the static facts of whole dialects but on fast-paced processes of change associated with new words entering the language. They collected a corpus of 107,000,000 tweets in English from 2009-2012 and looked only at words whose frequencies changed significantly over time. Below is an example, reproduced from their paper. It shows the expansion of the term ion (short for ‘I don’t’ as in ‘ion even care’) over a 150 week period.

eisenstein et al ion

One interesting finding which is immediately clear from such figures is that even for these sorts of words which are fundamentally written and exist (basically) only online, geography is relevant. On the face of it, we might expect words on the internet to spread randomly across space, as most of what is posted is publicly visible regardless of where you are. But the reality is that words basically spread through social networks, and these exist in real space, even if we’re watching them in action online.

Eisenstein et al. go on to examine the most common routes of linguistic diffusion, mapping the paths most often taken by new words between the cities, and then investigate what factors favour such linguistic pathways. They found that racial demographics were crucially important: linguistic differences were more likely to be transmitted between cities with similar proportions of African American citizens and Hispanic citizens. Small geographic distance and similar proportion residents of urbanised areas and median income also facilitated linguistic influence. Population also had an effect: larger settlements were more likely to exert influence than be subject to it.

These two studies are just a small intimation of the potential for linguistic research with social media, but hopefully you can start to see what an exciting area this promises to be!

