Introduction to Data Science with R

class: title-slide

.bg-text[
# Introduction to Data Science with R
### week.3

<hr />

9月 26, 2019  
謝舒凱
]

---
# 課程資訊

- [課程網頁課綱已更新](https://rlads2019.github.io/lecture/)

- 國慶連假後記名不計分小考預告（記名不計分）

---
# DataCamp 閱讀與練習作業

> 20% 同學免試

---
## 工具提醒

不一定要用 RStudio (e.g. 指令列 `Rscript test.R`)， 但它還可以做很多事 (to be continued...)

- 做 ([可重製 reproducible](http://rmarkdown.rstudio.com/lesson-1.html)、[可調參數 Parameterized](http://rmarkdown.rstudio.com/developer_parameterized_reports.html)、[互動型 interactive](http://rmarkdown.rstudio.com/lesson-14.html)) 筆記(notebook) 與報告 (report)

- 做投影片 (presentation)

- 做網站 (website) 與 web application (using `shiny`)

- 做「數位報表」(dashboard)

- 做專業科學文件 (using `$\LaTeX$`)

---
background-image: url(../img/emo/boredom-small.png)
---
## 各取強項

---
## Learning DS with R 
`對應式`學習意識

**R syntax** `$< >$`
【獲取】Obtaining data, 【整理】Scrubbing data, 【探索】Exploring data, 【建模】Modeling data, 【詮解】Interpreting (and reporting) data.

---
## R 的初體驗

```r
data()	      # browse pre-loaded data sets
data(rivers)	# get this one: "Lengths of Major North American Rivers"
?rivers
head(rivers,10)	# peek at the data set
```

```
##  [1]  735  320  325  392  524  450 1459  135  465  600
```

```r
length(rivers)	# how many rivers were measured?
```

```
## [1] 141
```

```r
summary(rivers) # what are some summary statistics?
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   310.0   425.0   591.2   680.0  3710.0
```

---
## R 的初體驗

```r
# make a histogram; play around with these parameters
hist(rivers, col="blue", border="white", breaks=25) 
```

![](index_files/figure-html/unnamed-chunk-2-1.png)

---
## In-class Exercise.1  
- 換看看顏色 (用 `colors()`  看 R 認識什麼顏色)

```r
hist(log(rivers), col="sienna", border="white", breaks=25) 
```

![](index_files/figure-html/unnamed-chunk-3-1.png)

---
## Base R and Package
An R package contains `data sets` and specific `functions` to solve specific question.

- 安裝發佈在 CRAN 上的套件:  `install.packages("ggplot2")`
- 安裝在 Github 上的開發套件：

```r
#install.packages("devtools")
#devtools::install_github("shukai/coolR)
```

---
## 比較圖形套件
`plot()` and others

```r
library("ggplot2")
# qplot: ggplot2 中最基本的繪圖函數
qplot(data = iris, Sepal.Length, Petal.Length, color = Species, size = Petal.Width)
```

![](index_files/figure-html/unnamed-chunk-5-1.png)

---
## 入門小秘訣

- `ctrl + l` 清除 console 的顯示內容。
- `rm(list=ls())` 清除 workspace 中的變數。
- 但請注意：R 也可以在終端機執行：對於日後在雲端伺服器工作者，特別是結合**指令列 (command line)** 很重要。
- 隨時知道妳在那裡：`getwd()` and `Set Working Directory`

---
## 變數 (variable)、賦值 (assignment)

- R 在給予變數值時是利用`<-` 而不是其他程式語言中常見的 `=`。（根據 R 官方文件解釋因為在某些狀況是會出問題）。
- 變數命名中，大小寫有所區別。所以 a 與 A 是不同的變數。

```r
a <- 19
a
```

```
## [1] 19
```

---
## Modes and classes of R objects

- 變數命名規則舉例：cannot start with numbers; it will start with a character or underscore; no special character allowed, such as @, #, $, and *.

- 存入變數後，它就是個物件 (object)。兩種最重要的物件屬性 (attribute) 是 `class` 與 `mode` (*numeric, character*, *logical*, *function*).
  - The `mode()` returns the mode of R objects. 表示物件在記憶體中是何種類型存儲的；類別概念以後再談。

```r
mode(rivers)
```

```
## [1] "numeric"
```

---
## 資料類型 (Data type) 與基本運算 (basic arithmetic)

資料類型包含以下幾種，可用 `mode` 函數判斷

- **數值型 (numeric)**：實數（可以寫成整數 integers，小數 floating numners，或 科學記述 scientific notations）

```r
b <- 8.31
mode(b)
```

```
## [1] "numeric"
```

- **字符型 (character)**：文字字串，放入 "" 或 '' 中

```r
c <- 'coding'
mode(c)
```

```
## [1] "character"
```

---
## 資料類型 (Data type) 與基本運算 (basic arithmetic)

- **邏輯型 (logical)**：`TRUE`(T) 和 `FALSE`(F) 兩個值

```r
d <- F
mode(d)
```

```
## [1] "logical"
```

- **複數型 (complex)** :取值包含虛數 `$a+bi$`

```r
e <- 2+3i
mode(e)
```

```
## [1] "complex"
```

---
## NA and NULL

- NA (*missing*)
- NULL (*undefined*)

---
## 資料類型強制轉換 (type coercion)：
  
  - If an R object contains both numeric and logical elements, the mode of that object will be numeric and in that case the logical element automatically gets converted to numeric. 
  - if any R object contains a character element along with both numeric and logical elements, it automatically converts to the character mode.

```r
# R object containing both numeric and logical element
x <- c(2, 4, TRUE, 6, FALSE, 8); mode(x)
```

```
## [1] "numeric"
```

```r
# R object containing character, numeric, and logical elements
y <- c(1,2,TRUE,FALSE,"a"); mode(y)
```

```
## [1] "character"
```

???
趕作業

---
## 資料類型的判斷與轉換

| 類型      | 意義    | 判斷         | 轉換         |
|-----------|---------|--------------|--------------|
| numeric   | 數值    | is.numeric()   | as.numeric()   |
| character | 字符    | is.character() | as.character() |
| logical   | 邏輯    | is.logical()   | as.logical()   |
| complex   | 複數    | is.complex()   | as.complex()   |
| NA        | 缺失    | is.na()        | as.na()        |

```r
is.character(b)
```

```
## [1] FALSE
```

```r
as.character(b)
```

```
## [1] "8.31"
```

---
## 資料結構 Data structure

- 一組（2 個以上）相同或不同**資料類型**的資料元素組合在一起形成**資料結構**.

- R 提供 6 個基本的資料結構：`vector`, `matrix`, `array`, `factor`, `list`, `data frame`.

- 學習重點在於**如何建立 create 與檢索 access**

---
## 向量 Vector
a combination of multiple values (`numeric, character` or `logical`)

### 建立
- `c()` ('concatenate')
- `:`  可產生差距為 1 的等差數列向量。
- `seq()` 可產生等差數列向量，差距值可以自行決定。
- `rep()` 可產生重複數值的向量。

```r
g <- c(1,2,3)
h <- c('me','you')
i <- 1:6
j <- seq(from=1, to=10, by=2)
k <- rep(1:4, times=3, each=2)
```

---
## 向量 Vector
### 檢索 access

- Get a subset of a vector: `my_vec[i]` to get the `ith` elment.
- Calculations with vectors: `max(), min(), length(), sum(), mean(),sd(),var()`, etc.

```r
m <- c(2:10)
m[1]
```

```
## [1] 2
```

```r
m[1:3]
```

```
## [1] 2 3 4
```

---
## 課堂練習.2
Preparing/Obtaining Data

- 資料格式 
 - Comma separated values (`*.csv`)
 - Text file with Tab delimited (`*.txt` or `*.tbl`)
 - MS Excel file (`*.xls` or `*.xlsx`)
 - R data object (`*.RData`)

- 資料來源 
  - Web (下載；網路爬蟲 Scraping and parsing data from the **web** (raw HTML sources)； Interacting with APIs)
  - 資料庫 database

---
## 課堂練習.2 
collaboration: the baby step

- go to [shared doc](https://docs.google.com/spreadsheets/d/1DbifnNulA2ReMjaAhWZkdx72QDWP2WXDw8th5mRXgRA/edit?usp=sharing)
- type in your data
- download it as `csv`, and read the file into R
- quick look at the data and do some preliminary analysis (in groups)

---
## 基本備檔
Data preparation rule

- use the first row as **column names** (which represent *variables*).

- Use the first column as **row names** (which represent *observations*).

- Avoid names with blank spaces. Good example：`person_look`, or `person.look`. Bad example: `person look`

- Avoid names with special symbols (excpet `_`)

- Avoid beggining variable names with a number. Good: `run_100m`, Bad: `100m`.

- R is case sensitive; and row/column name should be unique.

- Avoid blank rows in your data; delete any comments in your file.

- Replace missing values by **NA**.

- Use the four digit format for data. Good: `01/06/1970`, Bad: `01/06/70`.

---
## 檔案欄位說明
specification

連結放在本週 sli.do

-【gender】: 0(trans)-1(femail)-2(male)

-【grade】: 1-2-3-4-5

-【q.self】: 0-100

-【GPA】: previous average of GPA

-【q.teacher】: 0-100

---
## 下載成 csv 後讀檔

- Save as `Rclass.csv`, Importing data into R.

```r
# csv (逗號分隔 comma separated value file); csv2 (分號分隔 semicolon separated values)

#my_Rclass <- read.csv(file.choose())
```

- Explore the data and see what you can find