On Utilizing Parallel Sentence Corpora

Most of us have relied on a dictionary when working on an essay or writing assignment.  In my studies in Chinese I have often relied on the dictionary app Pleco to complete my assignments.  But sometimes we come to see the limitations of what a dictionary can do.

For example, the word “lost” can have range of meanings in English:
“I lost my wallet.” vs “I lost money in the stock market.” vs “I lost the competition.”

If you look up the word “lost” in a dictionary, you might find:
失去的 or 丢失的

If you hastily select one of these words you may fail to realize that this word in Chinese does not have the same range of meaning as it does in English.  For example, the words above cannot be used in the sentence “I lost the competition”.  This is where utilizing parallel sentence corpora can be very helpful.

What are parallel sentence corpora?
Parallel sentence corpora or parallel text corpora are large databases that store both original sentences and translated sentences from various sources (i.e. textbooks, popular books, songs, etc.)  By searching these databases, you can see how words are used in authentic ways.  For example:

http://jukuu.com/search.php?q=lost+wallet
http://jukuu.com/search.php?q=lost+stock+market
http://jukuu.com/search.php?q=lost+competition

Jukuu is primarily for English/Chinese and English/Japanese sentences.  Tatoeba is another website that has accumulated millions of sentences (and their translation) across 200+ languages.

Advertisements

Fixed Mindset vs. Growth Mindset

Many fail to realize that our attitudes and beliefs about language learning, and learning in general, can greatly influence our outcomes and success.

Carol Dweck, a psychologist at Stanford, is best known for her work on proposing two types of attitudes or thinking: a fixed mindset and a growth mindset.  In a fixed mindset, people believe that qualities such as intelligence or talent are fixed traits.  In a growth mindset people believe that these qualities of intelligence or talent can be developed and increased.  Below are a few more examples of how this plays out:

Fixed Mindset Growth Mindset
Skills Fixed; something you’re born with Can be improved; developed through hard word
Challenges Avoid; give up easily Embrace; persist when things get tough
Effort Unnecessary and fruitless Essential path to mastery
Criticism Ignore; get defensive Useful; opportunity to learn
Success of Others Threatened Inspired
Result Premature plateau Attain higher levels of success

What Does It Mean For Me?
Fortunately the conclusion from this research is that we can change the narrative we tell ourselves.  It is not necessarily the case that if you put your mind to it, you can do anything.  But when we cultivate passion and commit to stretching our existing abilities, we may just end up surprising ourselves.

In my personal journey to learn Chinese, I have taken inspiration from watching other foreigners who have attained very high levels.  I tell myself if they were able to get there, then there’s nothing stopping me from getting there either:
Julien Godfrey (朱利安)
Adventurina King (金小鱼)

Be sure to check out her book:

On Estimating Language Difficulty and Length of Study

If you’ve ever wondered how different languages stack up in terms of relative difficulty, you might have come across this infographic put out by Voxy:

Hard-Languages-To-Learn

The majority of the data for this infographic comes from research and data from The Foreign Service Institute (FSI), a training school of the U.S. State Department.  While these findings are interesting, we should be understand the limitations of the data and how they apply to our own situation:

  • The students at FSI are future diplomats or government interpreters who are paid to study full-time.  At FSI, we have found that it requires at least four class hours a day – usually more – for five days a week, plus three or more additional hours a day of independent study (Jackson, F. H., & Kaplan, M. A. ,2001).   Therefore the intensity and motivation of study can be quite different from our own learning context.
  • This data is based on estimates taken from native English speakers.  This means we should be careful in making inferences about non-native speakers of English learning one of these languages.
  • The estimated length of time is to attain ILR Level 3 (which is described as “professional working proficiency”), but not the time it takes to reach native or bilingual proficiency, ILR Level 5.
  • The mean age of FSI students is 41, most of whom go on to achieve or exceed their learning goals.  This is quite encouraging for us who feel we might have missed the window of opportunity of language learning after our teenage years.

REFERENCE
Jackson, F. H., & Kaplan, M. A. (2001). Lessons learned from fifty years of theory and practice in government language teaching. In J. E. Alatis & A.-H. Tan (Eds.), Georgetown University Round Table on Languages and Linguistics 1999: Language in Our Time (pp. 71–87). Washington, DC: Georgetown University Press.

Memory and Spaced Repetition System

Having a deeper understanding of how our brain works can really help us optimize our language learning approaches.  This is an area of study called metacognition, or understanding our cognitive processes and how our brain works.  For example, if we understand how memory works, and specifically how long-term memory works, we can improve our learning processes.

The graph below shows a concept that all of us are familiar with if we’ve ever tried to retain someone’s phone number in our short-term memory (maybe just long enough to write it down). The same applies to language learning. Anytime you learn a new fact, you have a pretty good chance of remembering it soon after.  But without review, a few days later, your chance of remembering that new fact drops to zero. We need regular review to consolidate the new facts.  Yet what the graph also shows is that each time you review that new fact, you are able to remember it for longer. This may seem obvious to you, but many people fail to take advantage of this phenomenon.

Forgetting

Enter Spaced Repetition System
The implication of the truth above is that each subsequent review of the new fact can be postponed further and further into the future.

Let’s take an example: You are trying to learn 100 new vocabulary words.  You want to use flashcards to commit them to memory.

Option A: You tediously flip though all 100 vocabulary flashcards.  You keep doing this each day until you have memorized all 100 vocabulary  words.

Option B: After studying the 100 flashcards for a few days you quickly realize that there are some words that are very easy to recall and some words that are still very hard to recall.  You choose to set aside the easy flashcards and only review it a week from now.  Then you choose to only focus on the hard words for the next couple days.

Here is a good video that explains this concept further: https://www.youtube.com/watch?v=Ay_zUGnreWQ

In the example above, the most efficient use of your time is to only review specific flashcards just as you are in danger of forgetting them.  You can manually set up a system (e.g. Leitner System) or you can allow technology to figure out the SRS for you.

There are various flashcard apps and software out on the market.  Just make sure whichever one you choose, that they are utilizing SRS.  Here are a few recommendations:

Universal Flashcard Apps
Anki
SuperMemo
Mnemosyne

Language Specific Flashcard Apps
Skritter
Memrise
Duolingo

 

 

 

How Many Chinese Characters Do I Really Need to Learn?

Depending on what sources you look at, there are 50,000 to 100,000 Chinese characters that exist in the Chinese language.  For someone learning Chinese, learning that many characters seems virtually impossible.

Fortunately, most of these characters are seldom used.  The Ministry of Education of China compiled the 现代汉语通用字表 (List of Commonly Used Characters in Modern Chinese) which lists 7,000 of the most common Chinese characters.  But learning 7,000 characters is still a huge task.

Do we really need to learn 7,000 character?
Whether you pick up a newspaper, magazine, novel, you will be able to recognize a certain percentage of the text with even a limited knowledge of characters.  You don’t have to slavishly study 50,000 characters, let alone master 7,000 characters.  The key is to focus on the most frequent characters first.

What if I know 500, 1000, 2000 characters?
There are a few studies that have looked at character frequency and word frequency across vast amounts of text (Huang, Da, Purohit).

The result is as follows.  If you know 500 characters, you will be able to read roughly 79% of any given text.  If you know 1,000 characters, you will be able to read roughly 91% of any given text. And by 3,000 characters, you will be able to read over 99% of any given text.  So it seems there is no need to study all 7,000 character for everyday life.

 Characters Huang Da Purohit Average
250 64.4% 57.1% 68.0% 63.2%
500 79.2% 72.1% 87% 79.4%
1,000 91.1% 86.2% 96.2% 91.2%
1,500 95.7% 92.4% 99.0% 95.7%
2,000 97.9% 95.6% 99.6% 97.7%
3,000 99.4% 98.3% 99.9% 99.2%

char_freq

Is that the Whole Story?
Unfortunately this is not the whole story.  When studying Chinese, you quickly realize that focusing solely on characters is insufficient, since words made from a combination of characters may take on new meaning.  For example:

大 = big
方 = square; direction; place

but together 大方 = generous

天 = sky
真 = real, true

but together 天真 = naive, innocent

What learners really need to pay attention to is learning words.  According to Purohit, the massive dataset he analyzed contained 3,848 characters, but these characters combined in various ways to make 26,767 words.  Further, his analysis shows what percentage of any given text we can read based on your vocabulary size.  For example, if you know 1,319 words, you would be able to read 80% of any given text.  If you know 2,801 words, you would be able to read 90% of any given text.

Words Percentage
230 50.0%
1,319 80.0%
2,801 90.0%
5,000 95.1%
12054 99.0%
21875 99.9%

word_freq

How Many Characters Do I Know Already?
There are a few online character recognition quizzes that can give an estimate, but it is a rough estimate:
Clavis Sinica Chinese Character Test
How many Chinese characters do you know?

If you have studied a specific textbook or prepared for the HSK, this can also give you a rough idea.  For example, after completing New Practical Chinese Reader 1 (NPCR 1) you will have learned 429 characters and 456 words.

Characters Words
NPCR 1 429 456
NPCR 2 807 1,059
NPCR 3 1,143 1,660
NPCR 4 1,423 2,337

Or after learning HSK vocabulary

Characters Words
HSK 1 174 150
HSK 2 300 347
HSK 3 617 600
HSK 4 1,064 1,200
HSK 5 1,685 2,500
HSK 6 2,663 5,000

What is the best way to study these characters and words?
The most efficient way to study is to use apps or software that employ Spaced Repetition Systems (SRS).  Apps like Skritter, Pleco’s flash card, or Anki.

When I first started studying Chinese, I committed practicing Chinese characters and words for 30 minutes / day on Skritter.  Over the span of my first year, with the help of SRS, I learned over 2000 characters and 4000 words.

Skritter

 

REFERENCES
https://puroh.it/how-many-chinese-characters-and-words-are-in-use/
http://lingua.mtsu.edu/chinese-computing
http://www.yellowbridge.com/chinese/topchars.php
http://chinese.stackexchange.com/questions/36/how-many-characters-do-i-need-to-learn

Chinese Text Analyser

I want to highlight a great resource for studying Chinese: Chinese Text Analyser

CTA

As a language learner you should already be taking advantage of the benefits of extensive reading. But one major requirement of extensive reading is the need to select material that is at just the right level for you (i.e. Roughly 1 unknown word for every 50 words).

Fortunately there are various reading resources, such as graded readers, that have become available in recent years.  It’s become easier for learners to find books that are at their current level (i.e. novice, intermediate, advanced).

But what if you wanted to read something beyond graded readers or textbooks?  Maybe Harry Potter in Chinese or the Chinese Bible? How could you assess whether a novel is within reach or completely beyond the grasp of your current level of Chinese?

Chinese Text Analyser helps fill this gap.  The software, which is available for both PC and Mac, is a tool that helps you analyze Chinese text.  In seconds it can process and segment various text files to give you statistics on characters and words.  You tell the software which characters and words you are familiar with, and the software will highlight the unknown characters and words in red.  The software also lets you export lists of words.

So for example, if after studying the HSK 6 vocab, you were keen on reading Harry Potter.  You can simply ask Chinese Text Analyser to highlight all the unknown words in Harry Potter that are not part of the HSK 1-6 vocab.  You could even export this list of unknown words to import into your favorite SRS flashcard app.

Here is a sample of some books I have analyzed with Chinese Text Analyser:

Fiction

Title 名字 Unique Characters Unique Words
Fantastic Mr Fox 了不起的狐狸爸爸 1,227 1,927
Little House on the Prairie 草原上的小木屋 1,398 2,479
The Little Prince 小王子 1,496 2,572
The House on Mango Street 芒果街上的小屋 1,859 3,495
Grandma in the Apple Tree 苹果树上得外婆 1,875 3,670
Charlie and the Chocolate Factory 查理和巧克力工厂 2,038 4,227
To Live 活着 1,881 4,361
Charlotte’s Web 夏洛的网 2,112 4,445
Chronicles of Narnia – Book 1 纳尼亚传奇 – 狮子女巫魔衣橱 2,149 4,827
Peter Pan 彼得潘 2,505 6,709
Wolf King Dream 狼王梦 2,826 7,457
Harry Potter – Book 1 哈利波特与魔法石 2,582 7,738
Hunger Games – Book 1 饥饿游戏1 2,728 8,458
Anne of Green Gables 禄山墙的安妮 2,751 8,879
Twilight – Book 1 暮光之城1 2,818 9,472

Non-Fiction

Title 名字 Unique Characters Unique Words
Power of Habit 习惯的力量 1,775 3,871
Tipping Point 引爆点 1,797 4,451
5 Love Languages 爱得五种语言 1,942 4,739
Country Driving 寻路中国 2,387 6,918
Knowing God 认识神 2,482 7,609
River Town 江城 2,911 10,760

Bible

Title 名字 Unique Characters Unique Words
Bible 圣经 3,001 10,562
Old Testament 旧约 2,903 9,430
New Testament 新约 2,209 6,169
Genesis 创世纪 1,502 2,723
Exodus 出埃及记 1,424 2,347
Leviticus 利未记 1,095 1,608
Numbers 民数记 1,303 2,135
Deuteronomy 申命记 1,433 2,341
Joshua 约书亚记 1,027 1,565
Judges 士师记 1,242 1,881
Ruth 路得记 490 555
1 Samuel 撒母耳记上 1,330 2,172
2 Samuel 撒母耳记下 1,338 2,141
1 Kings 列王记上 1,314 2,136
2 Kings 列王记下 1,284 2,063
1 Chronicles 历代志上 1,193 1,845
2 Chronicles 历代志下 1,381 2,332
Ezra 以斯拉记 846 1,137
Nehemiah 尼希米记 1,039 1,470
Esther 以斯帖记 727 907
Job 约伯记 1,577 2,689
Psalms 诗篇 1,755 3,449
Proverbs 箴言 1,441 2,315
Ecclesiastes 传道书 842 1,123
Song of Solomon 雅歌 683 743
Isaiah 以赛亚书 1,905 3,665
Jeremiah 耶利米书 1,679 3,095
Lamentations 耶利米哀歌 802 907
Ezekiel 以西结书 1,593 2,833
Daniel 但以理书 1,098 1,634
Hosea 何西阿书 954 1,180
Joel 约珥书 587 594
Amos 阿摩司书 864 990
Obadiah 俄巴底亚书 275 239
Jonah 约拿书 411 398
Micah 弥迦书 795 898
Nahum 那鸿书 575 543
Habakkuk 哈巴谷书 585 573
Zephaniah 西番雅书 535 524
Haggai 哈该书 335 313
Zechariah 撒迦利亚 930 1,167
Malachi 玛拉基书 518 507
Matthew 马太福音 1,420 2,461
Mark 马可福音 1,224 1,921
Luke 路加福音 1,493 2,701
John 约翰福音 1,011 1,638
Acts 使徒行传 1,368 2,560
Romans 罗马书 969 1,498
1 Corinthians 哥林多前书 956 1,437
2 Corinthians 哥林多后书 824 1,166
Galatians 加拉太书 605 763
Ephesians 以弗所书 653 810
Philippians 腓立比书 545 638
Colossians 歌罗西书 559 634
1 Thessalonians 帖撒罗尼迦前书 485 536
2 Thessalonians 帖撒罗尼迦后书 362 357
1 Timothy 提摩太前书 628 764
2 Timothy 提摩太后书 564 614
Titus 提多书 418 426
Philemon 腓利门书 226 211
Hebrews 希伯来书 980 1,430
James 雅各书 633 701
1 Peter 彼得前书 658 765
2 Peter 彼得后书 522 573
1 John 约翰一书 336 400
2 John 约翰二书 155 140
3 John 约翰三书 171 158
Jude 犹大书 358 322
Revelations 启示录 1,069 1,557

Benefits of Extensive Reading

books

Extensive reading is an approach to language learning that emphasizes reading large amounts of comprehensible text.  Both intensive reading and extensive reading are important, but language learners often end up spending the majority of their time doing intensive reading.

What is extensive reading?
– Reading is for developing fluency and overall understanding
– The text is longer (e.g. short story, novel)
– The learner selects what is enjoyable and interesting
– The text is relatively easy
– Mostly performed out of class

How does it differ from intensive reading?
– Reading is for a specific focus (e.g. grammar, vocabulary)
– Often short texts followed by comprehension questions
– The text is usually selected by the teacher or is the same for all learners (e.g. textbook)
– The text is usually quite difficult
– Often performed in class

What is comprehensible or easy text?
Most researchers specifically cite the need for the learner to find text that is 98% comprehensible.  In other words, in any given paragraph or page there is only 1 unknown word in every 50 words.  When a learner finds text at this sweet spot, they can enjoy the reading process without being slowed down by having to look up many words in the dictionary.

What are the benefits of extensive reading?
Good things happen to students who read a great deal in the foreign language. Research studies show they become better and more confident readers, they write better, their listening and speaking abilities improve, and their vocabularies get richer. In addition, they develop positive attitudes toward and increased motivation to study the new language.” (Bamford, Day, 2004, pg. 1)

Where can I learn more about extensive reading?
The Extensive Reading Foundation has a great guide to extensive reading (and is available in English, Japanese, Korean, Spanish, Farsi, Traditional Chinese, Vietnamese, Arabic):
Extensive Reading Guide

Extensive Reading Resources:
Resources for Chinese

REFERENCES
Bamford J., Day R.R. (eds) (2004). Extensive Reading Activities for Language Teaching. New York: Cambridge University Press.

Reading Resources for Chinese

There are many great reading resources for leaners of Chinese at various proficiency levels.

Graded Readers
These books are specifically created for language learners. Each book is created for a specific level (e.g. beginner, intermediate, advanced) and employ limited vocabulary and simplified grammar.

Chinese Breeze Series(汉语风)
(Level 1-4) (300/500/750/1100 Unique Words)
Chinese_Breeze

Friends Chinese Graded Reader (好朋友汉语分级读物)
(Level 1-6) (150/300/600/1200/2500/5000 Unique Words)
friends

Mandarin Companion (Level 1-2)
(300/450 Unique Characters)
mandarin companion

Rainbow Bridge Graded Chinese Reader (彩虹桥汉语分级读物)
(Level 1-6) (150/300/750/1000/1500/2500 Unique Words)
rainbow

Learn Chinese Graded Reader (汉语分级读物)
(Level 1-3) (500/800/1200 Unique Characters)
hanyu


Skills Books for Reading

These are textbooks specifically created to help develop your reading skills such as reading speed and reading comprehension.

Short-Term Reading Chinese (汉语阅读速成)
STRC

Read This Way (这样阅读)
RTW

Native Material
Once you reach a high enough proficiency level, native material can also become accessible.  Particularly, books or stories that are geared towards primary or middle school reading may be easy enough for language learners.

Fiction

Title 名字 Unique Characters Unique Words
Fantastic Mr Fox 了不起的狐狸爸爸 1227 1927
Little House on the Prairie 草原上的小木屋 1398 2479
The Little Prince 小王子 1496 2572
The House on Mango Street 芒果街上的小屋 1859 3495
Grandma in the Apple Tree 苹果树上得外婆 1875 3670
Charlie and the Chocolate Factory 查理和巧克力工厂 2038 4227
To Live 活着 1881 4361
Charlotte’s Web 夏洛的网 2112 4445
Chronicles of Narnia – Book 1 纳尼亚传奇 – 狮子女巫魔衣橱 2149 4827
Peter Pan 彼得潘 2505 6709
Wolf King Dream 狼王梦 2826 7457
Harry Potter – Book 1 哈利波特与魔法石 2582 7738
Hunger Games – Book 1 饥饿游戏1 2728 8458
Anne of Green Gables 禄山墙的安妮 2751 8879
Twilight – Book 1 暮光之城1 2818 9472