Introduction to Data Science with R

background-image: url(https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTdVuzJDXghejKmKbnlhzqE2SEez7J57fpQHXqE0FWIt_KS5FiJuw&s)
background-position: center
background-size: cover

class: title-slide

.bg-text[
# Introduction to Data Science with R
### week.13

<hr />

12月  5, 2019  
謝舒凱
]

---
# R 還可以做很多事

- 例如祝別人生日快樂 `birthday.R`

---
# R and Computational Music Analysis

Frequency of chord roots in **McGill Billboard dataset**

![](./images/bbroot.png)

---
## Text and/to Melody

![](./images/keys.jpg)

E.g, why is [AI.Text2Melody](https://melobytes.com/en/app/melobytes) still hard?

---
## 回顧（初級）文本分析基本流程圖示
<img style='border: 1px solid;' width=90% src='./images/tm001.jpg'></img>
[source](https://manoharswamynathan.files.wordpress.com/2015/04/r-text-mining-001.jpg)

---
## Text analytics / Text mining flow

目前前兩項應該大家都有概念了。

.large[
- **Preparing / Preprocessing / Representing textual data**. 
]
  - 蒐集與前處理 collecting raw text and preprocessing,
  - 文本表徵 representing text using e.g. Term Frequency-Inverse Document Frequency (TFIDF) to compute the usefulness of each word in the texts,
  - 字詞(語意)向量 word vectors (distributional and distributed semantics)

.large[
- **Exploratory data analysis and Infographics** 
]
  - data visualization for the purpose of discovery, lookin for groups in data, find outliers, identify common dimensions, patterns, and trends.)

- **Prediction models** (Regression; Classification and Clustering;) and Evaluations (Recommender systems, collaborative filtering, association rules, optimization methods based on linguistic heuristics, as well as a myriad of methods for regression, classification, and clustering fall under the rubric of machine learning).

---
background-image: url(../img/emo/boredom-small.png)
---
## 前處理回顧

![](images/tm.png)
[助教部落文](https://yongfu.name/2018/07/28/quanteda-tutorial.html)

---
## 從前處理到 EDA
[大家一起走一次 quanteda tutorial](https://tutorials.quanteda.io)

![](images/quanteda.png)

---
## Textual EDA: 文本資料怎麼作圖？

- 我們想要利用視覺化技術探勘文本中的訊息、趨勢、模式變化。例如
  - 批踢踢語料中呈現的鄉民行為與社會網路
  - 不同作者的書寫風格
  - （選前選後的）政治觀點、主張、價值比較

- 基本的可能

- 文字雲 (word cloud) 與比較
  - 關聯圖 (correlation plot) 與字/詞組樹 (word/phrase tree)
  - 社會網路 (social network)

---
## 文字雲 Word Cloud

- A word cloud is simply a graphical representation in which the size of the font used for the word corresponds to its frequency relative to others. Bigger the size of the word, higher is its frequency.

- `wordcloud2`, `RColorBrewer` 都可以。

---
## 文字雲也可以比較

- To construct a **comparison cloud**, we require the data to be in the form of a term matrix. The `tm` package provides us with the `TermDocumentMatrix()` function that constructs a term document matrix:

```r
#colnames(data) <- c("bush","obama")
#comparison.cloud(data,max.words = 250, title.size = 2,colors = brewer.pal(3,"Set1"))
```
<img src="images/cloud.jpg" alt="Drawing" style="width: 400px;"/>

---
## 詞組樹 wordTree

`googleVis: R Interface to Google Charts`
- A phrase tree or a word tree provides useful insight into text as it provides a context and not just the frequency of words. <https://www.jasondavies.com/wordtree/>

```r
library(googleVis)
wt1 <- gvisWordTree(Cats, textvar = "Phrase")
plot(wt1)
```

---
## motion chart 對於關注資料變遷有幫助

---
## 語言大數據發揮創意的話也可以看到很多東西

[Google book ngram](https://books.google.com/ngrams/)

---
## 用 R 玩看看

- [`ngramr`](https://github.com/seancarmody/ngramr): R package to query the Google Ngram Viewer

```r
library(ngramr) # install locally!
library(ggplot2)
ggram(c("monarchy", "democracy"), year_start = 1500, year_end = 2000, 
      corpus = "eng_gb_2012", ignore_case = TRUE, 
      geom = "area", geom_options = list(position = "stack")) + 
      labs(y = NULL)
```

![](index_files/figure-html/unnamed-chunk-3-1.png)

---
## 這個圖怎麼解釋？

---
## 從語言學角度的提醒

.large[
- 停用詞 (`stop words`) 慎用

- 語料庫 (`corpus`) 的概念與應用沒那麼簡單
]

---
## 語言資源 (Language Resources)

- 語料庫 corpus
- 詞庫 lexicon / lexical (knowledge) resources
- 知識本體 ontologies

---
## 語料庫：概念

- 語料庫 (Corpus) 是自然語言處理與文本解析的基礎建設。
a large collection of texts used for various purposes in Natural Language Processing (NLP).

- 標記 (annotation) 是核心。It's linguistic in nature.

> Good annotations support good applications

---
## 語料庫：工具

一般主要提供以下功能：

- Corpus building and indexing
- Concordance
- Frequency list
- (Grammatical) Collocations (and colligations)
- Keywords
- Thesaurus
- ngram
- Visualization

---
## 語料庫：工具

[Voyant](http://voyant-tools.org/?corpus=7fda0cccc3e3da40ce4f6b5c38347689)
<img style = 'border: 1px solid;' width = 90%; src='./images/voyant.png'></img>
[Word Sketch Engine](https://www.sketchengine.eu)