- Introduction
- Text analytics as the Core
- Sense and Sentiment
- Conclusion
DASH: A new gradute certificate: data management, statistics, text analysis, geospatial analysis, digital prosopography, and data visualization and information design.
Natural language is primarily hard because it is messy. There are few rules. And yet we can easily understand each other most of the time.
We are awash with text, from books, papers, blogs, tweets, news, and increasingly text fromspoken utterances.
Human language is highly ambiguous … It is also ever changing and evolving. Peopleare great at producing language and understanding language, and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. Atthe same time, while we humans are great users of language, we are also very poorat formally understanding and describing the rules that govern language. (Neural Network Methods in Natural Language Processing, 2017. http://amzn.to/2u0JtPl)
library(jiebaR) # 用 worker() 初始化斷詞引擎 seg <- worker() # 各種參數設定情參考 ?worker # 斷詞簡單例子 seg["據台大語言所小編謝舒凱表示,宅宅也是非常用功 der"]
## [1] "據" "台大" "語言所" "小編" "謝舒凱" "表示" "宅宅" ## [8] "也" "是" "非常" "用功" "der"
seg["你這種人還是死死去最好了"]
## [1] "你" "這種" "人" "還是" "死" "死去" "最好" "了"
德文複合詞: Rindfleischetikettierungsberwachungsaufgabenbertragungsgesetz
(“the law concerning the delegation of duties for the supervision of cattle marking and the labelling of beef”)
土耳其文: OSMANLILŞTIRAMAYABIĪLECEKLERIĪMIĪZDENMIĪŞSIĪNIĪZCESIĪNE
(“as if you were of those whom we might consider not converting into an Ottoman”)
#txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、 #社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。" #tokens(txt_jp) txt_khmer <- "តៃវ៉ាន់បោះជំហានឆ្ពោះទៅរកការធ្វើពាណិជ្ជកម្មនៅអាស៊ីដើម្បីកាត់បន្ថយភាពអាស្រ័យលើប្រទេសចិន " #Taiwan Steps up Asia Business to Reduce Dependence on China #taivean baohchomhan chhpaohtow rokkarothveu peanechchokamm now asai daembi #katbanthoy pheap asry leu bratesa chen tokens(txt_khmer)
## tokens from 1 document. ## text1 : ## [1] "តៃវ៉ាន់" "បោះជំហាន" "ឆ្ពោះទៅ" "រកការធ្វើ" "ពាណិជ្ជកម្ម" "នៅ" "អាស៊ី" ## [8] "ដើម្បី" "កាត់បន្ថយ" "ភាព" "អាស្រ័យ" "លើ" "ប្រទេស" "ចិន"
Emotions are at the heart of what it means to be human… we act in the world either volitionally or emotionally.(Mason,2015)
Affect
is a broader all-encompassing term whch refers to general topics of emotion, feelings, and mood together.
心情,氣質,脾氣,性情,性格,情緒,態度,人格,品行。
- 不同視野下的詞彙行為/知識表徵化
Operationalized lexical knowledge representation
E.g., 我們不(只)關心「打」的詞義有哪些,更關心
在什麼標準|脈落
下「打」的詞義分成幾個;
這樣的標記訊息
(Categorical and/or Numerical) 跟該單位在其他脈絡下被觀察到的行為 (習得,情緒,發展,語言教學,神經表徵,心理反應等) 之間的關聯為何?
以這樣的第二層知識出發提供「語言學養分」給 NLP/Machine Learning (rather than shallow linguistic feature engineering
)
Reused, Reproduced, Reshaped and Reinforced.
It takes the functional position (usage-based view) in determining units and patterns (in Chinese), as well as the ontological grounding on the relation between linguistic objects and situations (bits of reality). (Langacker 1987, 1988, 1999; Croft 2002; Tomasello 2003; Bybee 2006, 2010)
Lexical data at different levels are modularized (only for practical reasons), such as syntax-semantics module, emotion module, discourse and pragmatic module, diachronic module, etc. Researchers from different fields can initiate a new cooperation based upon.
Hanzi | Semantics | Emotion | Lexical.Age | Aquisition | Social Network | Morpho-syntax | —– |
---|---|---|---|---|---|---|---|
phonetics | sense | polarity | 1930.freq | 3y.freq | indegree | POS | —– |
components | relations | classes | 1940.freq | 4y.freq | outdegree | productivity | —– |
—— | —— | —– | —– | —– | —– | —– | —– |
At the moment there are 55k units (ranging from characters to lexical chunks) with over than 150 variables. The scope and size are still evolving, with its concerted and long-term efforts we believe this resource will be valuable for deep processing of natural language processing and intelligent applications.
文本分析(text analytics)
、語料庫(corpus)
、機器學習(machine learning)
、自然語言處理(natural lanaguage processing)
It involves lexical analysis, categorization, clustering, pattern recognition, tagging, annotation, information extraction, link and association analysis, visualization, and predictive analytics. Text Analytics determines key words, topics, category, semantics, tags from the millions of text data available in an organization in different files and formats. The term Text Analytics is roughly synonymous with text mining.
Exploratory data analysis and Infographics (data visualization for the purpose of discovery. We look for groups in data, find outliers, identify common dimensions, patterns, and trends.)
Prediction models (Regression; Classification and Clustering;) and Evaluations (Recommender systems, collaborative filtering, association rules, optimization methods based on linguistic heuristics, as well as a myriad of methods for regression, classification, and clustering fall under the rubric of machine learning).
如何能?!語言學、語言資源 與 自然語言處理 !!
Good annotations support good applications
likelihood ratio
measures), etc.(Benoit, 2016)
>
>
>
word2vec
)
word2vec
tmcn.word2vec
text2vec
(Hamilton et al, 2016)
inter-coder agreement, reliability 信度, validation 效度,
優點
缺點
gold standard
)GATE
CAT (Coding Analysis Toolkit)
tm
, tmcn
quanteda
, preText
, text2vec
syuzhet
, sentimentr
, tidytext
,A Bright Future for AHSS (Arts, Humanities and Social Sciences)?