Regression
流浪汪星人 怎麼做 data package?
https://data.gov.tw/dataset/25602 98年度台灣地區各縣市流浪狗數調查結果?
https://animal.coa.gov.tw/html/index_06_0621_dog.html
computerR
: 各縣市平均每100個家庭的電腦數目
graduate
: 各縣市研究所畢業者人數
require(dplyr)
require(ggplot2)
#https://www.dropbox.com/s/t8doe35f3o2we2m/dogs.txt?dl=0
# big5 data 用 file - reopen with encoding - save with encoding
dogs <- read.table("../../../data/txt/dogs.txt", header = T, fileEncoding="utf8")
qplot(computerR, adoptedR, data=dogs)
lm.model <- lm(adoptedR~computerR, data=dogs, x =TRUE)
summary(lm.model)
#plot(computerR~adoptedR, data=dogs)
ggplot(dogs, aes(x = computerR, y = adoptedR)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
anova(lm.model)
- 殘差分析 (Residuals analysis)
plot(lm.model)
resid <- lm.model$residuals
# using ggplot: https://rpubs.com/therimalaya/43190
# check normality
shapiro.test(resid)
- Correlation matrix 選取解釋變數
dogs1 <- dogs[-c(1,5)]
round(cor(dogs1),2); head(dogs1)
pairs(dogs1)
# or using ggplot2
#install.package('GGally')
require(GGally)
Loading required package: GGally
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Attaching package: ‘GGally’
The following object is masked from ‘package:dplyr’:
nasa
ggpairs(dogs1[, c(2, 4:6)])
Error in ggpairs(dogs1[, c(2, 4:6)]) : 找不到物件 'dogs1'
比較紐約狗
The nycdogs
package contains three datasets, nyc_license, nyc_bites
, and nyc_zips
. They contain, respectively, data on all licensed dogs in New York city, data on reported dog bites in New York city, and geographical data for New York city at the zip code level.
- nycdogs is a data package, bundling several datasets into a convenient format.
#devtools::install_github("kjhealy/nycdogs")
library(tidyverse)
library(sf)
require(nycdogs)
Loading required package: nycdogs
To look at the tibble that contains the licensing data
nyc_license
- use the
nyc_zips
object to create a map of, for example, the prevalence of dog names by zip code:
nyc_coco <- nyc_license %>%
group_by(zip_code, animal_name) %>%
tally() %>%
mutate(freq = n / sum(n),
pct = round(freq*100, 2)) %>%
filter(animal_name == "Coco")
nyc_coco
coco_map <- left_join(nyc_zips, nyc_coco)
Joining, by = "zip_code"
## Map theme
theme_nymap <- function(base_size=9, base_family="") {
require(grid)
theme_bw(base_size=base_size, base_family=base_family) %+replace%
theme(axis.line=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(),
panel.spacing=unit(0, "lines"),
plot.background=element_blank(),
legend.justification = c(0,0),
legend.position = c(0.1, 0.6),
legend.direction = "horizontal"
)
}
coco_map %>% ggplot(mapping = aes(fill = pct)) +
geom_sf(color = "gray80", size = 0.1) +
scale_fill_viridis_c(option = "A") +
labs(fill = "Percent of Licensed Dogs") +
annotate(geom = "text", x = -74.145, y = 40.82,
label = "Where's Coco?", size = 6) +
theme_nymap() +
guides(fill = guide_legend(title.position = "top",
label.position = "bottom"))
Topic modeling
從文件中找出潛在「主題」的統計模型。
LDA 的[數學推導](原文:Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” The Journal of Machine Learning Research 3(1): 993-1022.)
先用 quanteda
的例子練習看看。
require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodels)
corp_news <- download('data_corpus_guardian')
嘗試 URL 'https://www.dropbox.com/s/7mu92jzodpq11zc/data_corpus_guardian.rds?dl=1&v=1'
Content type 'application/binary' length 11347175 bytes (10.8 MB)
==================================================
downloaded 10.8 MB
corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
ndoc(corp_news_subset)
[1] 6000
記得在文本處理中,採 non-position 的作法,要先把 text 表徵成 document-term matrix
(document-feature matrix
)。 where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any n-gram of interest.同時也要(在一些預設之下)決定如何前處理你的資料。
- 以這個例子來說,前處理包括了移除功能詞、標點符號;找出顯著特徵詞作為 features the top 5% of the most frequent features (
min_termfreq = 0.95
) that appear in less than 10% of all documents (max_docfreq = 0.1
) using dfm_trim()
to focus on common but distinguishing features.
dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>%
dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>%
dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
dfmat_news <- dfmat_news[ntoken(dfmat_news) > 0,]
- 施行主題模型 topic modeling (使用
topicmodel
套件中的 LDA
),預先決定主題數量 \(k\) (需要一點時間)
dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10)
- 建好模型之後,就可以看看這些主題使用了哪些重要詞彙
terms(lda, 10)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "hospital" "online" "nhs" "climate" "officers"
[2,] "violence" "game" "food" "china" "investigation"
[3,] "died" "games" "workers" "chinese" "justice"
[4,] "doctors" "music" "funding" "leadership" "inquiry"
[5,] "medical" "users" "housing" "paris" "allegations"
[6,] "child" "google" "income" "agreement" "trial"
[7,] "parents" "apple" "scheme" "development" "officer"
[8,] "woman" "internet" "customers" "elections" "charges"
[9,] "mental" "facebook" "cuts" "trade" "criminal"
[10,] "mother" "video" "data" "politics" "alleged"
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
[1,] "energy" "trump" "mps" "syria" "markets"
[2,] "water" "clinton" "australia" "military" "prices"
[3,] "climate" "republican" "referendum" "refugees" "shares"
[4,] "gas" "sanders" "australian" "isis" "oil"
[5,] "project" "obama" "corbyn" "attacks" "sales"
[6,] "land" "donald" "labor" "russian" "rate"
[7,] "environmental" "cruz" "tory" "syrian" "banks"
[8,] "green" "2016" "2016" "forces" "euro"
[9,] "coal" "hillary" "mp" "russia" "investors"
[10,] "food" "presidential" "osborne" "islamic" "quarter"
docvars(dfmat_news, 'topic') <- topics(lda)
head(topics(lda), 20)
text136751 text118588 text45146 text93623 text136585 text65682 text107174
4 2 10 8 8 5 3
text22792 text32425 text139163 text169133 text90312 text153451 text31104
5 9 3 1 6 5 3
text163885 text81309 text157885 text99128 text173244 text27905
6 10 2 9 6 9
應用
多觀摩一些專案,例如
Every year since 1947, representatives of UN member states gather at the annual sessions of the United Nations General Assembly. The centrepiece of each session is the General Debate. This is a forum at which leaders and other senior officials deliver statements that present their government’s perspective on the major issues in world politics. These statements are akin to the annual legislative state-of-the-union addresses in domestic politics. This new dataset, the UN General Debate Corpus (UNGDC), introduces the corpus of texts of General Debate statements from 1970 (Session 25) to 2016 (Session 71). Basic visualization and corpus exploration tools are available here.
完整從 http://shakespeare.mit.edu 抓檔過程在此。但若理解不需要再無故排放 CO2。資料在此。說明參見:https://m-clark.github.io/text-analysis-with-R/topic-modeling.html
可以參考 Julia Silge (text mining with R 作者) 最近做的 side project 福爾摩斯的故事
# `stm` require `glmnet`, which requieres R >= 3.6.0 so you would have to update R to be able to install it library(stm)
# check your R version with `version`, download latest R (3.6.2), install and restart Rstudio
#library(stm)
----------------------------------------
# Word embeddings
https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html
---
title: "Introduction to Modeling. week15"
output: html_notebook
---



# Regression

流浪汪星人 怎麼做 data package?

https://data.gov.tw/dataset/25602
98年度台灣地區各縣市流浪狗數調查結果?

https://animal.coa.gov.tw/html/index_06_0621_dog.html




- `computerR`: 各縣市平均每100個家庭的電腦數目
- `graduate`: 各縣市研究所畢業者人數


```{r}
require(dplyr)
require(ggplot2)
#https://www.dropbox.com/s/t8doe35f3o2we2m/dogs.txt?dl=0
# big5 data 用 file - reopen with encoding - save with encoding
dogs <- read.table("../../../data/txt/dogs.txt", header = T, fileEncoding="utf8")
qplot(computerR, adoptedR, data=dogs)

```



- 迴歸係數計算推論
```{r}
lm.model <- lm(adoptedR~computerR, data=dogs, x =TRUE)
summary(lm.model)

```


```{r}
#plot(computerR~adoptedR, data=dogs)
ggplot(dogs, aes(x = computerR, y = adoptedR)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
```



- 簡單線性迴歸的 ANOVA 表格
```{r}
anova(lm.model)
```


- 殘差分析 (Residuals analysis)

```{r}
plot(lm.model)
resid <- lm.model$residuals
# using ggplot: https://rpubs.com/therimalaya/43190
# check normality
shapiro.test(resid)

```



- Correlation matrix 選取解釋變數

```{r}
dogs1 <- dogs[-c(1,5)]
round(cor(dogs1),2); head(dogs1)
```



```{r}
pairs(dogs1)
# or using ggplot2 
```


```{r}
#install.package('GGally')
require(GGally)
ggpairs(dogs1[, c(2, 4:6)])

```

## 比較紐約狗


The `nycdogs` package contains three datasets, `nyc_license, nyc_bites`, and `nyc_zips`. They contain, respectively, data on all licensed dogs in New York city, data on reported dog bites in New York city, and geographical data for New York city at the zip code level.

- nycdogs is a data package, bundling several datasets into a convenient format.

```{r}
#devtools::install_github("kjhealy/nycdogs")
library(tidyverse)
library(sf)
require(nycdogs)
```


To look at the tibble that contains the licensing data

```{r}
nyc_license
```



- use the `nyc_zips` object to create a map of, for example, the prevalence of dog names by zip code:


```{r}
nyc_coco <- nyc_license %>%
    group_by(zip_code, animal_name) %>%
    tally() %>%
    mutate(freq = n / sum(n),
           pct = round(freq*100, 2)) %>%
    filter(animal_name == "Coco")

nyc_coco
```


```{r}
coco_map <- left_join(nyc_zips, nyc_coco)
## Map theme
theme_nymap <- function(base_size=9, base_family="") {
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              panel.background=element_blank(),
              panel.border=element_blank(),
              panel.grid=element_blank(),
              panel.spacing=unit(0, "lines"),
              plot.background=element_blank(),
              legend.justification = c(0,0),
              legend.position = c(0.1, 0.6), 
              legend.direction = "horizontal"
        )
}
```

- 可以作地圖


```{r}
coco_map %>% ggplot(mapping = aes(fill = pct)) +
    geom_sf(color = "gray80", size = 0.1) +
    scale_fill_viridis_c(option = "A") +
    labs(fill = "Percent of Licensed Dogs") +
  annotate(geom = "text", x = -74.145, y = 40.82, 
           label = "Where's Coco?", size = 6) + 
    theme_nymap() + 
    guides(fill = guide_legend(title.position = "top", 
                               label.position = "bottom"))
```


--------------------------------------

# Topic modeling 

- 從文件中找出潛在「主題」的統計模型。

- LDA 的[數學推導](原文：Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." The Journal of Machine Learning Research 3(1): 993-1022.)

- 先用 `quanteda` 的例子練習看看。

```{r}
require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodel)
```


- 下載英國衛報資料，挑選 2016 出版的新聞

```{r}
corp_news <- download('data_corpus_guardian')
corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
ndoc(corp_news_subset)
```



> 記得在文本處理中，採 non-position 的作法，要先把 text 表徵成 `document-term matrix` (`document-feature matrix`)。 where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any n-gram of interest.同時也要（在一些預設之下）決定如何前處理你的資料。


- 以這個例子來說，前處理包括了移除功能詞、標點符號；找出顯著特徵詞作為 **features**  the top 5% of the most frequent features (`min_termfreq = 0.95`) that appear in less than 10% of all documents (`max_docfreq = 0.1`) using `dfm_trim()` to focus on common but distinguishing features.

```{r}
dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>% 
    dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>% 
    dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile", 
             max_docfreq = 0.1, docfreq_type = "prop")

dfmat_news <- dfmat_news[ntoken(dfmat_news) > 0,]

```


- 施行主題模型 topic modeling (使用 `topicmodel` 套件中的 `LDA`)，預先決定主題數量 $k$ （需要一點時間）

```{r}
dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10)
```


- 建好模型之後，就可以看看這些主題使用了哪些重要詞彙

```{r}
terms(lda, 10)
```

- 也可以儲存起來當成文件的後設資料

```{r}
docvars(dfmat_news, 'topic') <- topics(lda)
head(topics(lda), 20)

```

- 怎麼視覺化這些主題的分佈？
`LDAvis`(https://github.com/cpsievert/LDAvis/)

![](https://knowledger.rbind.io/img/LDAvis.png)


## 應用

多觀摩一些專案，例如

- [聯合國 General Debate Corpus](https://data.world/ian/united-nations-general-debate-corpus)。可以先找看看是否別人已經把資料處理好了，如 [kaggle 平台](https://www.kaggle.com/unitednations/un-general-debates#un-general-debates.csv)。用主題模型來看看不同年代，[各國代表都在談什麼「主題」。](http://ungd.smikhaylov.net/)
  - 延伸：台灣國會？不同縣市的議會質詢？


> Every year since 1947, representatives of UN member states gather at the annual sessions of the United Nations General Assembly. The centrepiece of each session is the General Debate. This is a forum at which leaders and other senior officials deliver statements that present their government’s perspective on the major issues in world politics. These statements are akin to the annual legislative state-of-the-union addresses in domestic politics. This new dataset, the UN General Debate Corpus (UNGDC), introduces the corpus of texts of General Debate statements from 1970 (Session 25) to 2016 (Session 71). Basic visualization and corpus exploration tools are available here.





- 用莎翁名劇玩看看

完整從 http://shakespeare.mit.edu 抓檔過程[在此](https://m-clark.github.io/text-analysis-with-R/shakespeare.html#shakespeare-start-to-finish)。但若理解不需要再無故排放 CO2。[資料在此](https://github.com/m-clark/text-analysis-with-R/blob/master/data/shakes_dtm_stemmed.RData)。說明參見：https://m-clark.github.io/text-analysis-with-R/topic-modeling.html


- 進階的主題模型: Structural topic model ([with `stm` package](https://www.structuraltopicmodel.com/)) 

可以參考 Julia Silge (text mining with R 作者) 最近做的 side project [福爾摩斯的故事](https://juliasilge.com/blog/sherlock-holmes-stm/)


```{r}
# `stm` require `glmnet`, which requieres R >= 3.6.0 so you would have to update R to be able to install it library(stm)
# check your R version with `version`, download latest R (3.6.2), install and restart Rstudio

#library(stm)



----------------------------------------

# Word embeddings

https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html
