class: title-slide .bg-text[ # Introduction to Data Science with R ### week.3 <hr /> 9月 26, 2019 謝舒凱 ] --- # 課程資訊 - [課程網頁課綱已更新](https://rlads2019.github.io/lecture/) - 國慶連假後記名不計分小考預告(記名不計分) --- # DataCamp 閱讀與練習作業 > 20% 同學免試 --- ## 工具提醒 不一定要用 RStudio (e.g. 指令列 `Rscript test.R`), 但它還可以做很多事 (to be continued...) - 做 ([可重製 reproducible](http://rmarkdown.rstudio.com/lesson-1.html)、[可調參數 Parameterized](http://rmarkdown.rstudio.com/developer_parameterized_reports.html)、[互動型 interactive](http://rmarkdown.rstudio.com/lesson-14.html)) 筆記(notebook) 與報告 (report) - 做投影片 (presentation) - 做網站 (website) 與 web application (using `shiny`) - 做「數位報表」(dashboard) - 做專業科學文件 (using `\(\LaTeX\)`) --- background-image: url(../img/emo/boredom-small.png) --- ## 各取強項 <img src = https://i2.wp.com/www.business-science.io/assets/2018-10-08-python-and-r/python_r_workflow.png?zoom=2&w=456 scale="50%"></img> --- ## Learning DS with R `對應式`學習意識 **R syntax** `\(< >\)` 【獲取】Obtaining data, 【整理】Scrubbing data, 【探索】Exploring data, 【建模】Modeling data, 【詮解】Interpreting (and reporting) data. --- ## R 的初體驗 ```r data() # browse pre-loaded data sets data(rivers) # get this one: "Lengths of Major North American Rivers" ?rivers head(rivers,10) # peek at the data set ``` ``` ## [1] 735 320 325 392 524 450 1459 135 465 600 ``` ```r length(rivers) # how many rivers were measured? ``` ``` ## [1] 141 ``` ```r summary(rivers) # what are some summary statistics? ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 135.0 310.0 425.0 591.2 680.0 3710.0 ``` --- ## R 的初體驗 ```r # make a histogram; play around with these parameters hist(rivers, col="blue", border="white", breaks=25) ``` ![](index_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## In-class Exercise.1 - 換看看顏色 (用 `colors()` 看 R 認識什麼顏色) ```r hist(log(rivers), col="sienna", border="white", breaks=25) ``` ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Base R and Package An R package contains `data sets` and specific `functions` to solve specific question. - 安裝發佈在 CRAN 上的套件: `install.packages("ggplot2")` - 安裝在 Github 上的開發套件: ```r #install.packages("devtools") #devtools::install_github("shukai/coolR) ``` --- ## 比較圖形套件 `plot()` and others ```r library("ggplot2") # qplot: ggplot2 中最基本的繪圖函數 qplot(data = iris, Sepal.Length, Petal.Length, color = Species, size = Petal.Width) ``` ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## 入門小秘訣 - `ctrl + l` 清除 console 的顯示內容。 - `rm(list=ls())` 清除 workspace 中的變數。 - 但請注意:R 也可以在終端機執行:對於日後在雲端伺服器工作者,特別是結合**指令列 (command line)** 很重要。 - 隨時知道妳在那裡:`getwd()` and `Set Working Directory` --- ## 變數 (variable)、賦值 (assignment) - R 在給予變數值時是利用`<-` 而不是其他程式語言中常見的 `=`。(根據 R 官方文件解釋因為在某些狀況是會出問題)。 - 變數命名中,大小寫有所區別。所以 a 與 A 是不同的變數。 ```r a <- 19 a ``` ``` ## [1] 19 ``` --- ## Modes and classes of R objects - 變數命名規則舉例:cannot start with numbers; it will start with a character or underscore; no special character allowed, such as @, #, $, and *. - 存入變數後,它就是個物件 (object)。兩種最重要的物件屬性 (attribute) 是 `class` 與 `mode` (*numeric, character*, *logical*, *function*). - The `mode()` returns the mode of R objects. 表示物件在記憶體中是何種類型存儲的;類別概念以後再談。 ```r mode(rivers) ``` ``` ## [1] "numeric" ``` --- ## 資料類型 (Data type) 與基本運算 (basic arithmetic) 資料類型包含以下幾種,可用 `mode` 函數判斷 - **數值型 (numeric)**:實數(可以寫成整數 integers,小數 floating numners,或 科學記述 scientific notations) ```r b <- 8.31 mode(b) ``` ``` ## [1] "numeric" ``` - **字符型 (character)**:文字字串,放入 "" 或 '' 中 ```r c <- 'coding' mode(c) ``` ``` ## [1] "character" ``` --- ## 資料類型 (Data type) 與基本運算 (basic arithmetic) - **邏輯型 (logical)**:`TRUE`(T) 和 `FALSE`(F) 兩個值 ```r d <- F mode(d) ``` ``` ## [1] "logical" ``` - **複數型 (complex)** :取值包含虛數 `\(a+bi\)` ```r e <- 2+3i mode(e) ``` ``` ## [1] "complex" ``` --- ## NA and NULL - NA (*missing*) - NULL (*undefined*) --- ## 資料類型強制轉換 (type coercion): - If an R object contains both numeric and logical elements, the mode of that object will be numeric and in that case the logical element automatically gets converted to numeric. - if any R object contains a character element along with both numeric and logical elements, it automatically converts to the character mode. ```r # R object containing both numeric and logical element x <- c(2, 4, TRUE, 6, FALSE, 8); mode(x) ``` ``` ## [1] "numeric" ``` ```r # R object containing character, numeric, and logical elements y <- c(1,2,TRUE,FALSE,"a"); mode(y) ``` ``` ## [1] "character" ``` ??? 趕作業 --- ## 資料類型的判斷與轉換 | 類型 | 意義 | 判斷 | 轉換 | |-----------|---------|--------------|--------------| | numeric | 數值 | is.numeric() | as.numeric() | | character | 字符 | is.character() | as.character() | | logical | 邏輯 | is.logical() | as.logical() | | complex | 複數 | is.complex() | as.complex() | | NA | 缺失 | is.na() | as.na() | ```r is.character(b) ``` ``` ## [1] FALSE ``` ```r as.character(b) ``` ``` ## [1] "8.31" ``` --- ## 資料結構 Data structure - 一組(2 個以上)相同或不同**資料類型**的資料元素組合在一起形成**資料結構**. - R 提供 6 個基本的資料結構:`vector`, `matrix`, `array`, `factor`, `list`, `data frame`. - 學習重點在於**如何建立 create 與檢索 access** --- ## 向量 Vector a combination of multiple values (`numeric, character` or `logical`) ### 建立 - `c()` ('concatenate') - `:` 可產生差距為 1 的等差數列向量。 - `seq()` 可產生等差數列向量,差距值可以自行決定。 - `rep()` 可產生重複數值的向量。 ```r g <- c(1,2,3) h <- c('me','you') i <- 1:6 j <- seq(from=1, to=10, by=2) k <- rep(1:4, times=3, each=2) ``` --- ## 向量 Vector ### 檢索 access - Get a subset of a vector: `my_vec[i]` to get the `ith` elment. - Calculations with vectors: `max(), min(), length(), sum(), mean(),sd(),var()`, etc. ```r m <- c(2:10) m[1] ``` ``` ## [1] 2 ``` ```r m[1:3] ``` ``` ## [1] 2 3 4 ``` --- ## 課堂練習.2 Preparing/Obtaining Data - 資料格式 - Comma separated values (`*.csv`) - Text file with Tab delimited (`*.txt` or `*.tbl`) - MS Excel file (`*.xls` or `*.xlsx`) - R data object (`*.RData`) - 資料來源 - Web (下載;網路爬蟲 Scraping and parsing data from the **web** (raw HTML sources); Interacting with APIs) - 資料庫 database --- ## 課堂練習.2 collaboration: the baby step - go to [shared doc](https://docs.google.com/spreadsheets/d/1DbifnNulA2ReMjaAhWZkdx72QDWP2WXDw8th5mRXgRA/edit?usp=sharing) - type in your data - download it as `csv`, and read the file into R - quick look at the data and do some preliminary analysis (in groups) --- ## 基本備檔 Data preparation rule - use the first row as **column names** (which represent *variables*). - Use the first column as **row names** (which represent *observations*). - Avoid names with blank spaces. Good example:`person_look`, or `person.look`. Bad example: `person look` - Avoid names with special symbols (excpet `_`) - Avoid beggining variable names with a number. Good: `run_100m`, Bad: `100m`. - R is case sensitive; and row/column name should be unique. - Avoid blank rows in your data; delete any comments in your file. - Replace missing values by **NA**. - Use the four digit format for data. Good: `01/06/1970`, Bad: `01/06/70`. --- ## 檔案欄位說明 specification 連結放在本週 sli.do -【gender】: 0(trans)-1(femail)-2(male) -【grade】: 1-2-3-4-5 -【q.self】: 0-100 -【GPA】: previous average of GPA -【q.teacher】: 0-100 --- ## 下載成 csv 後讀檔 - Save as `Rclass.csv`, Importing data into R. ```r # csv (逗號分隔 comma separated value file); csv2 (分號分隔 semicolon separated values) #my_Rclass <- read.csv(file.choose()) ``` - Explore the data and see what you can find