Introduction to Data Science with R

background-image: url(figures/lopen.jpg)
background-position: center
background-size: cover

class: title-slide

.bg-text[
# Introduction to Data Science with R
### week.10

<hr />

11月 14, 2019  
謝舒凱
]

---
# Introdution to Text Analytics

.large[我們在哪裡]

---
## 假設妳玩出興趣了，可以開始尋找妳適合的位置

- 利用 `kaggle` 等競賽平台讓自己置身國際獵才對象

---
# 期末專案可以開始啟動

- **下週分組**/期末聯合展演，接下來儘量不要缺席

.large[在你開始學習實作資料科學專案之際，同時想想

- 夠環保嗎？

- 與人／社會的關聯？

- 機器會做得比我好嗎？
]

???

分組？

---
# 當然運動休閒娛樂也是很重要的人生課題

Sport analytics, e.g., [Animating NBA Play by Play using R](http://curleylab.psych.columbia.edu/nba.html)

---
# 這堂課到目前我們都學著從電腦的角度來思考

- 目前的計算架構需要.red[明確、循序漸進的指令]。人呢？
舉例來說，妳如何判斷圖中誰最高？

---
# 演算思維

- 那你覺得電腦如何判斷？

```r
# 先從第一個人的身高開始。
# 假定他是「最高的人」。
# 將其他人的身高和目前「最高的人」做逐一的比較。
# 每次只要有人身高超越目前「最高的人」，他就取代變成目前「最高的人」。
# 比較完之後，就找出來了。
```

---
# 現在開始學著讓電腦從人的角度來思考

- AI and HI (Human Intelligence)

- 運算、記憶、視覺與聽覺等`感知能力(Perception)`、洞察推論及規劃決策的`認知能力(Cognition)`、創造力以及`智慧`(Wisdom), 還有`情緒`。  
  
- .large[**語言**（語音、文本、手語）與多模態溝通（手勢、體勢與其他）是最為核心也最為複雜的部分之一。]

---
## 例：說謊/測謊是人的本事

<span class="footnote"> The language of lying [Noah Zandon]()</span>

---
## 例二：表情符號是人類溝通中的創造力表現

<img src = './figures/emoticon.jpeg' width="450" height="350px"></img>
[表情符號追蹤器](http://emojitracker.com/)

---
## 例三：'Often' is not only 'often'

你常常說：【`常常;常不自覺;真的很不懂`】嗎？

---
# 文本分析 (Text Analytics) / Text Mining

.large[
- 語言與認知太複雜，先從文本分析開始。

- 從資料科學的實務角度，目前文本還是最大宗的數據來源。]

> 處理程序類似:
- 資料前處理 Pre-processing
- 資料探索分析與圖示 Exploratory data analysis (statistic summary/graphical representation)
- [語言分析與標記 (Linguistic annotation and analysis)]
- 機器學習模式 Predictive modeling (regression, classification, clustering)
- 資料圖形呈現與動態回饋 Reproducible, infographic Report (`Data <> Story`)

---
## `tidytext`: Text mining using dplyr, ggplot2, and other tidy tools

- 文本分析必讀 [Tidy Text Mining with R](http://tidytextmining.com/)

- [多觀摩專案](https://markrstevenson.com/blog/wXMUdQCIfHqwTzrnProy)

---
## （文本）資料科學中的三組人馬

(Text analytics | Text mining) 、 (NLP | .red[Linguistics]) 、 (Machine Learning | Statistics)

.large[
- Text analytics ( `$\simeq$` text mining) can be viewed as a set of **(computational) linguistic (NLP)** and **(statistical) machine learning** techniques that model and discover the information content of textual data for diffirent purposes (e.g., business intelligence, research, or investigation).]

- Textual **data**, textual **information**, textual **knowledge**.
  - [Data Science] Linguistic/textual `data` processing
  - [Natural Language Processing] Linguistic/textual `information` processing
  - [Semantics, Ontologies, AI and Language Understanding] Linguistic/textual `knowledge` processing

---
# 語言學與文本分析？

語言學看韓黑

![](figures/hanhei.png)

???

韓黑大學

---
# Tokenization
## 先來玩一個遊戲 | Tracing of words [@sinclair2006linear]

* 文本解謎
  
    ```
    theheadmasterofharrowtellsannmcferranwhyhehasl
    etthetvcameraintoaschoolfullofodditiesbarnablelen 
    on30thheadmasterofharrowschoolleansoverhisdesk 
    therearemoreimportantthingsinlifethanstrawboaters
    ```

---
## 答案
   
    ```
    the headmaster of harrow tells Ann Mcferran why he has 
    let the tv camera into a school full of oddities Barnaby Lenon 
    30th headmaster of harrow school leans over his desk there are more 
    important things in life than straw boaters
    ```

---
## 中文斷詞／分詞

- 沒有詞邊界產生的**結構歧義**.

- *小花生了很久才出來*；*阿里巴巴創辦人馬雲端上新服務*； *可是她可是網路上最紅的人欸*

- At least two R packages: `Rwordseg` and `JiebaR`.

---
# 中文不孤單

```r
txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
tokens(txt_jp)
```

```
## tokens from 1 document.
## text1 :
##  [1] "政治"         "と"           "は"           "社会"        
##  [5] "に対して"     "全体"         "的"           "な"          
##  [9] "影響"         "を"           "及"           "ぼ"          
## [13] "し"           "、"           "社会"         "で"          
## [17] "生きる"       "ひとりひとり" "の"           "人"          
## [21] "の"           "人生"         "に"           "も"          
## [25] "様々"         "な"           "影響"         "を"          
## [29] "及ぼす"       "複雑"         "な"           "領域"        
## [33] "で"           "ある"         "。"
```

---
background-image: url(../img/emo/boredom-small.png)
---
# 學寮文就知道外國人學中文的苦

```r
txt_lao <- "ຜູ້ອໍານວຍການໂຮງຮຽນໄດ້ບອກAnn Mcferranວ່າເປັນຫຍັງລາວຈຶ່ງປ່ອຍກ້ອງຖ່າຍຮູບທີວີເຂົ້າໂຮງຮຽນທີ່ເຕັມໄປ
ດ້ວຍຄວາມບໍ່ສະຫຼາດໃຈBarnaby Lenon ເປັນຜູ້ຈັດການໂຮງຮຽນມັດທະຍົມ 30 ປີ Leain ໃນໂຕະລາວມີສິ່ງທີ່ສໍາຄັນໃນຊີວິດຫຼາ"
tokens(txt_lao)
```

```
## tokens from 1 document.
## text1 :
##  [1] "ຜູ້"             "ອໍານວຍ"         "ການ"           "ໂຮງຮຽນ"       
##  [5] "ໄດ້"            "ບອກAnn"        "Mcferranວ່າເປັນ" "ຫຍັງ"          
##  [9] "ລາວ"           "ຈຶ່ງ"            "ປ່ອຍ"           "ກ້ອງ"          
## [13] "ຖ່າຍຮູບ"         "ທີ"             "ວີ"             "ເຂົ້າ"          
## [17] "ໂຮງຮຽນ"        "ທີ່"             "ເຕັມ"           "ໄປ"           
## [21] "ດ້ວຍ"           "ຄວາມ"          "ບໍ່"             "ສະຫຼາດ"        
## [25] "ໃຈBarnaby"     "Lenon"         "ເປັນ"           "ຜູ້ຈັດການ"       
## [29] "ໂຮງຮຽນ"        "ມັດທະຍົມ"        "30"            "ປີ"            
## [33] "Leain"         "ໃນ"            "ໂຕະ"           "ລາວ"          
## [37] "ມີ"             "ສິ່ງ"            "ທີ່"             "ສໍາ"           
## [41] "ຄັນ"            "ໃນ"            "ຊີວິດ"           "ຫຼາ"
```

---
## `JiebaR`

- 先使用 `worker()` 初始化分詞引擎。

```r
require(jiebaR)
require(jiebaRD)
mixSeg <- worker()  
#hmmSeg <- worker(type = "hmm")
text2 <- "總有一天你會醒來，告訴我一切都是假的"
#segment(text2, mixSeg)
# 或是利用分詞運算子 <=
mixSeg <= text2
#segment(".\\data\\test.txt", mixSeg)
```

---
## `JiebaR`: 客製化 custimization

```r
mixSeg
# $user
# "/Library/Frameworks/R.framework/Versions/3.2/Resources/library
# /jiebaRD/dict/user.dict.utf8"
```

---
## `JiebaR`: POS tagging 
(POS tagset: [ICTCLAS 漢語詞性標註集](http://www.cnblogs.com/chenbjin/p/4341930.html))

```r
pos.tagger <- worker("tag")
pos.tagger <= text2
```

```
##          l          r          v          v          v          r 
## "總有一天"       "你"       "會"     "醒來"     "告訴"       "我" 
##          i          n         uj 
## "一切都是"       "假"       "的"
```

---
## `JiebaR`: Keywords Extraction and Similarity Calculation

- `Simhash` algorithm

```r
key.extract <- worker(type = "keywords", topn = 1)
key.extract <= text2
```

```
## 11.7392 
##    "假"
```

```r
sim <- worker(type = "simhash", topn = 2)
sim <= text2
```

```
## $simhash
## [1] "11853487723018994224"
## 
## $keyword
## 11.7392 11.7392 
##  "告訴"    "假"
```

---
## [In-class Exercise]: 魯迅:阿 Q 正傳

```r
luxun <- scan("http://www.gutenberg.org/files/25332/25332-0.txt",
                what="char", sep="\n")

# another lazy way
require(gutenbergr)
luxun <- gutenberg_download(25332)
mixSeg <= luxun$text
luxun.seg <- segment(luxun$text, mixSeg)
write.table(luxun.seg, 'luxun.txt')
```

---
## In-class Exercise

(individual, 100 pt)
- (10 pt) Download a Chinese novel (except 魯迅:阿 Q 正傳) from Gutenberg website, clean and preprocess the text (incl. using `jiebaR()` to segment the text). 
- (50 pt) Create a sorted word-freq list. 
- (20 pt) Add a POS column (using `jiebaR()` again) to the list and write it to a file.
- (20 pt) Extract all the **pronouns** (labeled as `r`), count the occurrences separately, make the table and plot.

---
## Chinese Word, 有嗎？

- Wordhood assumption is still controversary

- (From character morpheme, word, chunks/idiomatic expressions)

---
## .green[「還在那邊」]的斷詞剖析 
### BEFORE

> 不是我要說他，已經什麼都有了他還在那邊邱什麼。別人我不敢說，這種人喔實在是 $#* :-(

### AFTER

```r
tokens("不是我要說他，已經什麼都有了他還在那邊邱什麼。別人我不敢說，這種人喔實在是 $#* :-( $#*")
```

```
## tokens from 1 document.
## text1 :
##  [1] "不是"   "我要"   "說"     "他"     "，"     "已經"   "什麼"  
##  [8] "都有"   "了"     "他"     "還在"   "那邊"   "邱"     "什麼"  
## [15] "。"     "別人"   "我"     "不敢"   "說"     "，"     "這種"  
## [22] "人"     "喔"     "實在是" "$"      "#"      "*"      ":"     
## [29] "-"      "("      "$"      "#"      "*"
```

---
## 怎麼說一個人帥？

???
可以【耍帥】不能【耍正】

---
## 語言表達單位的討論 | 自由語到熟語

<span style= "font-size: 0.95em;">
字 ("`愛`")、雙字詞 ("`愛情`")、三字詞 ("`大不了`")、四字格 ("`沒大沒小`")、四字成語 ("`一葉知秋`")、格言 ("`滿招損謙受益`")、歇後語、諺語 。。
</span>