Regression

流浪汪星人 怎麼做 data package?

https://data.gov.tw/dataset/25602 98年度台灣地區各縣市流浪狗數調查結果?

https://animal.coa.gov.tw/html/index_06_0621_dog.html

require(dplyr)
require(ggplot2)
#https://www.dropbox.com/s/t8doe35f3o2we2m/dogs.txt?dl=0
# big5 data 用 file - reopen with encoding - save with encoding
dogs <- read.table("../../../data/txt/dogs.txt", header = T, fileEncoding="utf8")
qplot(computerR, adoptedR, data=dogs)
lm.model <- lm(adoptedR~computerR, data=dogs, x =TRUE)
summary(lm.model)
#plot(computerR~adoptedR, data=dogs)
ggplot(dogs, aes(x = computerR, y = adoptedR)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
anova(lm.model)
plot(lm.model)
resid <- lm.model$residuals
# using ggplot: https://rpubs.com/therimalaya/43190
# check normality
shapiro.test(resid)
dogs1 <- dogs[-c(1,5)]
round(cor(dogs1),2); head(dogs1)
pairs(dogs1)
# or using ggplot2 
#install.package('GGally')
require(GGally)
Loading required package: GGally
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: ‘GGally’

The following object is masked from ‘package:dplyr’:

    nasa
ggpairs(dogs1[, c(2, 4:6)])
Error in ggpairs(dogs1[, c(2, 4:6)]) : 找不到物件 'dogs1'

比較紐約狗

The nycdogs package contains three datasets, nyc_license, nyc_bites, and nyc_zips. They contain, respectively, data on all licensed dogs in New York city, data on reported dog bites in New York city, and geographical data for New York city at the zip code level.

  • nycdogs is a data package, bundling several datasets into a convenient format.
#devtools::install_github("kjhealy/nycdogs")
library(tidyverse)
library(sf)
require(nycdogs)
Loading required package: nycdogs

To look at the tibble that contains the licensing data

nyc_license
  • use the nyc_zips object to create a map of, for example, the prevalence of dog names by zip code:
nyc_coco <- nyc_license %>%
    group_by(zip_code, animal_name) %>%
    tally() %>%
    mutate(freq = n / sum(n),
           pct = round(freq*100, 2)) %>%
    filter(animal_name == "Coco")

nyc_coco
coco_map <- left_join(nyc_zips, nyc_coco)
Joining, by = "zip_code"
## Map theme
theme_nymap <- function(base_size=9, base_family="") {
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              panel.background=element_blank(),
              panel.border=element_blank(),
              panel.grid=element_blank(),
              panel.spacing=unit(0, "lines"),
              plot.background=element_blank(),
              legend.justification = c(0,0),
              legend.position = c(0.1, 0.6), 
              legend.direction = "horizontal"
        )
}
  • 可以作地圖
coco_map %>% ggplot(mapping = aes(fill = pct)) +
    geom_sf(color = "gray80", size = 0.1) +
    scale_fill_viridis_c(option = "A") +
    labs(fill = "Percent of Licensed Dogs") +
  annotate(geom = "text", x = -74.145, y = 40.82, 
           label = "Where's Coco?", size = 6) + 
    theme_nymap() + 
    guides(fill = guide_legend(title.position = "top", 
                               label.position = "bottom"))


Topic modeling

require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodels)
corp_news <- download('data_corpus_guardian')
嘗試 URL 'https://www.dropbox.com/s/7mu92jzodpq11zc/data_corpus_guardian.rds?dl=1&v=1'
Content type 'application/binary' length 11347175 bytes (10.8 MB)
==================================================
downloaded 10.8 MB
corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
ndoc(corp_news_subset)
[1] 6000

記得在文本處理中,採 non-position 的作法,要先把 text 表徵成 document-term matrix (document-feature matrix)。 where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any n-gram of interest.同時也要(在一些預設之下)決定如何前處理你的資料。

dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>% 
    dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>% 
    dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile", 
             max_docfreq = 0.1, docfreq_type = "prop")

dfmat_news <- dfmat_news[ntoken(dfmat_news) > 0,]
dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10)
terms(lda, 10)
      Topic 1    Topic 2    Topic 3     Topic 4       Topic 5        
 [1,] "hospital" "online"   "nhs"       "climate"     "officers"     
 [2,] "violence" "game"     "food"      "china"       "investigation"
 [3,] "died"     "games"    "workers"   "chinese"     "justice"      
 [4,] "doctors"  "music"    "funding"   "leadership"  "inquiry"      
 [5,] "medical"  "users"    "housing"   "paris"       "allegations"  
 [6,] "child"    "google"   "income"    "agreement"   "trial"        
 [7,] "parents"  "apple"    "scheme"    "development" "officer"      
 [8,] "woman"    "internet" "customers" "elections"   "charges"      
 [9,] "mental"   "facebook" "cuts"      "trade"       "criminal"     
[10,] "mother"   "video"    "data"      "politics"    "alleged"      
      Topic 6         Topic 7        Topic 8      Topic 9    Topic 10   
 [1,] "energy"        "trump"        "mps"        "syria"    "markets"  
 [2,] "water"         "clinton"      "australia"  "military" "prices"   
 [3,] "climate"       "republican"   "referendum" "refugees" "shares"   
 [4,] "gas"           "sanders"      "australian" "isis"     "oil"      
 [5,] "project"       "obama"        "corbyn"     "attacks"  "sales"    
 [6,] "land"          "donald"       "labor"      "russian"  "rate"     
 [7,] "environmental" "cruz"         "tory"       "syrian"   "banks"    
 [8,] "green"         "2016"         "2016"       "forces"   "euro"     
 [9,] "coal"          "hillary"      "mp"         "russia"   "investors"
[10,] "food"          "presidential" "osborne"    "islamic"  "quarter"  
docvars(dfmat_news, 'topic') <- topics(lda)
head(topics(lda), 20)
text136751 text118588  text45146  text93623 text136585  text65682 text107174 
         4          2         10          8          8          5          3 
 text22792  text32425 text139163 text169133  text90312 text153451  text31104 
         5          9          3          1          6          5          3 
text163885  text81309 text157885  text99128 text173244  text27905 
         6         10          2          9          6          9 

應用

多觀摩一些專案,例如

Every year since 1947, representatives of UN member states gather at the annual sessions of the United Nations General Assembly. The centrepiece of each session is the General Debate. This is a forum at which leaders and other senior officials deliver statements that present their government’s perspective on the major issues in world politics. These statements are akin to the annual legislative state-of-the-union addresses in domestic politics. This new dataset, the UN General Debate Corpus (UNGDC), introduces the corpus of texts of General Debate statements from 1970 (Session 25) to 2016 (Session 71). Basic visualization and corpus exploration tools are available here.

  • 用莎翁名劇玩看看

完整從 http://shakespeare.mit.edu 抓檔過程在此。但若理解不需要再無故排放 CO2。資料在此。說明參見:https://m-clark.github.io/text-analysis-with-R/topic-modeling.html

可以參考 Julia Silge (text mining with R 作者) 最近做的 side project 福爾摩斯的故事

# `stm` require `glmnet`, which requieres R >= 3.6.0 so you would have to update R to be able to install it library(stm)
# check your R version with `version`, download latest R (3.6.2), install and restart Rstudio

#library(stm)



----------------------------------------

# Word embeddings

https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html
---
title: "Introduction to Modeling. week15"
output: html_notebook
---



# Regression

流浪汪星人 怎麼做 data package?

https://data.gov.tw/dataset/25602
98年度台灣地區各縣市流浪狗數調查結果?

https://animal.coa.gov.tw/html/index_06_0621_dog.html




- `computerR`: 各縣市平均每100個家庭的電腦數目
- `graduate`: 各縣市研究所畢業者人數


```{r}
require(dplyr)
require(ggplot2)
#https://www.dropbox.com/s/t8doe35f3o2we2m/dogs.txt?dl=0
# big5 data 用 file - reopen with encoding - save with encoding
dogs <- read.table("../../../data/txt/dogs.txt", header = T, fileEncoding="utf8")
qplot(computerR, adoptedR, data=dogs)

```



- 迴歸係數計算推論
```{r}
lm.model <- lm(adoptedR~computerR, data=dogs, x =TRUE)
summary(lm.model)

```


```{r}
#plot(computerR~adoptedR, data=dogs)
ggplot(dogs, aes(x = computerR, y = adoptedR)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
```



- 簡單線性迴歸的 ANOVA 表格
```{r}
anova(lm.model)
```


- 殘差分析 (Residuals analysis)

```{r}
plot(lm.model)
resid <- lm.model$residuals
# using ggplot: https://rpubs.com/therimalaya/43190
# check normality
shapiro.test(resid)

```



- Correlation matrix 選取解釋變數

```{r}
dogs1 <- dogs[-c(1,5)]
round(cor(dogs1),2); head(dogs1)
```



```{r}
pairs(dogs1)
# or using ggplot2 
```


```{r}
#install.package('GGally')
require(GGally)
ggpairs(dogs1[, c(2, 4:6)])

```

## 比較紐約狗


The `nycdogs` package contains three datasets, `nyc_license, nyc_bites`, and `nyc_zips`. They contain, respectively, data on all licensed dogs in New York city, data on reported dog bites in New York city, and geographical data for New York city at the zip code level.

- nycdogs is a data package, bundling several datasets into a convenient format.

```{r}
#devtools::install_github("kjhealy/nycdogs")
library(tidyverse)
library(sf)
require(nycdogs)
```


To look at the tibble that contains the licensing data

```{r}
nyc_license
```



- use the `nyc_zips` object to create a map of, for example, the prevalence of dog names by zip code:


```{r}
nyc_coco <- nyc_license %>%
    group_by(zip_code, animal_name) %>%
    tally() %>%
    mutate(freq = n / sum(n),
           pct = round(freq*100, 2)) %>%
    filter(animal_name == "Coco")

nyc_coco
```


```{r}
coco_map <- left_join(nyc_zips, nyc_coco)
## Map theme
theme_nymap <- function(base_size=9, base_family="") {
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              panel.background=element_blank(),
              panel.border=element_blank(),
              panel.grid=element_blank(),
              panel.spacing=unit(0, "lines"),
              plot.background=element_blank(),
              legend.justification = c(0,0),
              legend.position = c(0.1, 0.6), 
              legend.direction = "horizontal"
        )
}
```

- 可以作地圖


```{r}
coco_map %>% ggplot(mapping = aes(fill = pct)) +
    geom_sf(color = "gray80", size = 0.1) +
    scale_fill_viridis_c(option = "A") +
    labs(fill = "Percent of Licensed Dogs") +
  annotate(geom = "text", x = -74.145, y = 40.82, 
           label = "Where's Coco?", size = 6) + 
    theme_nymap() + 
    guides(fill = guide_legend(title.position = "top", 
                               label.position = "bottom"))
```


--------------------------------------

# Topic modeling 

- 從文件中找出潛在「主題」的統計模型。

- LDA 的[數學推導](原文：Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." The Journal of Machine Learning Research 3(1): 993-1022.)

- 先用 `quanteda` 的例子練習看看。

```{r}
require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodel)
```


- 下載英國衛報資料，挑選 2016 出版的新聞

```{r}
corp_news <- download('data_corpus_guardian')
corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
ndoc(corp_news_subset)
```



> 記得在文本處理中，採 non-position 的作法，要先把 text 表徵成 `document-term matrix` (`document-feature matrix`)。 where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any n-gram of interest.同時也要（在一些預設之下）決定如何前處理你的資料。


- 以這個例子來說，前處理包括了移除功能詞、標點符號；找出顯著特徵詞作為 **features**  the top 5% of the most frequent features (`min_termfreq = 0.95`) that appear in less than 10% of all documents (`max_docfreq = 0.1`) using `dfm_trim()` to focus on common but distinguishing features.

```{r}
dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>% 
    dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>% 
    dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile", 
             max_docfreq = 0.1, docfreq_type = "prop")

dfmat_news <- dfmat_news[ntoken(dfmat_news) > 0,]

```


- 施行主題模型 topic modeling (使用 `topicmodel` 套件中的 `LDA`)，預先決定主題數量 $k$ （需要一點時間）

```{r}
dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10)
```


- 建好模型之後，就可以看看這些主題使用了哪些重要詞彙

```{r}
terms(lda, 10)
```

- 也可以儲存起來當成文件的後設資料

```{r}
docvars(dfmat_news, 'topic') <- topics(lda)
head(topics(lda), 20)

```

- 怎麼視覺化這些主題的分佈？
`LDAvis`(https://github.com/cpsievert/LDAvis/)

![](https://knowledger.rbind.io/img/LDAvis.png)


## 應用

多觀摩一些專案，例如

- [聯合國 General Debate Corpus](https://data.world/ian/united-nations-general-debate-corpus)。可以先找看看是否別人已經把資料處理好了，如 [kaggle 平台](https://www.kaggle.com/unitednations/un-general-debates#un-general-debates.csv)。用主題模型來看看不同年代，[各國代表都在談什麼「主題」。](http://ungd.smikhaylov.net/)
  - 延伸：台灣國會？不同縣市的議會質詢？


> Every year since 1947, representatives of UN member states gather at the annual sessions of the United Nations General Assembly. The centrepiece of each session is the General Debate. This is a forum at which leaders and other senior officials deliver statements that present their government’s perspective on the major issues in world politics. These statements are akin to the annual legislative state-of-the-union addresses in domestic politics. This new dataset, the UN General Debate Corpus (UNGDC), introduces the corpus of texts of General Debate statements from 1970 (Session 25) to 2016 (Session 71). Basic visualization and corpus exploration tools are available here.





- 用莎翁名劇玩看看

完整從 http://shakespeare.mit.edu 抓檔過程[在此](https://m-clark.github.io/text-analysis-with-R/shakespeare.html#shakespeare-start-to-finish)。但若理解不需要再無故排放 CO2。[資料在此](https://github.com/m-clark/text-analysis-with-R/blob/master/data/shakes_dtm_stemmed.RData)。說明參見：https://m-clark.github.io/text-analysis-with-R/topic-modeling.html


- 進階的主題模型: Structural topic model ([with `stm` package](https://www.structuraltopicmodel.com/)) 

可以參考 Julia Silge (text mining with R 作者) 最近做的 side project [福爾摩斯的故事](https://juliasilge.com/blog/sherlock-holmes-stm/)


```{r}
# `stm` require `glmnet`, which requieres R >= 3.6.0 so you would have to update R to be able to install it library(stm)
# check your R version with `version`, download latest R (3.6.2), install and restart Rstudio

#library(stm)



----------------------------------------

# Word embeddings

https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html
