Introduction to Data Science with R

background-image: url(http://fusionanalyticsworld.com/wp-content/uploads/2016/08/Data-Science-with-R1.jpg?a73fae)
background-position: center
background-size: cover

class: title-slide

.bg-text[
# Introduction to Data Science with R
### week.7

<hr />

10月 31, 2019  
謝舒凱
]

---
## Data Manipulation using `dplyr` and `tidyr`

- .large[data wrangling (= munging)] using `dplyr`

> the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.)

- .large[data reshaping] : change the layout of data set using `tidyr`

> about changing the way data is organized into rows and columns.

---
## `tidyr`

- a package that **reshape** the layout of tables (into **tidy** format),  particularly designed to work in combination with `magrittr and dplyr` to build a solid data analysis pipeline.

- Four main functions

---
## tidy data

Recall that a data set is called tidy when:

- each column represents a variable
- each row represents an observation (or case)

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

---
background-image: url(../img/emo/boredom-small.png)
---
## `tidyr`

```r
#install.packages("tidyr")
library("tidyr")
# use built-in dataset ""USArrest"
```

- Start by subsetting a small data set,

```r
small <- USArrests[c(1, 10, 20, 30,40),]; small
```

```
##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Georgia          17.4     211       60 25.8
## Maryland         11.3     300       67 27.8
## New Jersey        7.4     159       89 18.8
## South Carolina   14.4     279       48 22.5
```

- Row names are states, use `cbind()` to add a column named 'state' in the data. which will make the data tidy and the analysis easier.

```r
my_data <- cbind(state = rownames(small), small); my_data
```

```
##                         state Murder Assault UrbanPop Rape
## Alabama               Alabama   13.2     236       58 21.2
## Georgia               Georgia   17.4     211       60 25.8
## Maryland             Maryland   11.3     300       67 27.8
## New Jersey         New Jersey    7.4     159       89 18.8
## South Carolina South Carolina   14.4     279       48 22.5
```

---
## Tidy form

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

---
## `gather()` collapse columns into rows

- collapses multiple columns into key-value pairs. 寬格式轉長格式 It produces a 'long' data format from a 'wide' one.
  - `gather(dataframe, key, value, ...)`

```r
my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)

# gather only Murder and Assault
#my_data2 <- gather(my_data,
#                   key = "arrest_attribute",
#                   value = "arrest_estimate",
#                   Murder, Assault)
```

???
Note that, all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).

---
## `spread()`: spread two columns into multiple columns

相反動作

- does the reverse of `gather()`. It takes two columns (key and value) and spreads into multiple columns. It produces a 'wide' data format from a 'long' one.

```r
my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
                   )
```

---
## `unite()` : Unite multiple columns into one

- takes multiple columns and paste them together into one.
  - `unite(data, col, ..., sep = "_")`

```r
my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
```

---
## `separate()`: separate one column into multiple

相反動作

-  the reverse of `unite()`. It takes values inside a single character column and separates them into multiple columns.

```r
# Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault
separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")
```

---
## Chaining together

- recall that `x %>% f` is equivalent to `\(f(x)\)`

```r
my_data %>% gather(key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop) %>%
            unite(col = "attribute_estimate",
                  arrest_attribute, arrest_estimate)
```

```
##             state Rape attribute_estimate
## 1         Alabama 21.2        Murder_13.2
## 2         Georgia 25.8        Murder_17.4
## 3        Maryland 27.8        Murder_11.3
## 4      New Jersey 18.8         Murder_7.4
## 5  South Carolina 22.5        Murder_14.4
## 6         Alabama 21.2        Assault_236
## 7         Georgia 25.8        Assault_211
## 8        Maryland 27.8        Assault_300
## 9      New Jersey 18.8        Assault_159
## 10 South Carolina 22.5        Assault_279
## 11        Alabama 21.2        UrbanPop_58
## 12        Georgia 25.8        UrbanPop_60
## 13       Maryland 27.8        UrbanPop_67
## 14     New Jersey 18.8        UrbanPop_89
## 15 South Carolina 22.5        UrbanPop_48
```

---
# 小抄速記 Cheatsheet

.large[
- [可以善用 cheatsheet](https://rstudio.com/resources/cheatsheets/)

- 徹底弄懂這個 [Data Wrangling with dplyr and tidyr](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

]

---
## 期中考加分練習

- 4 人一組。
- 繳交 html 檔到 NTU COOL.

```r
# 檔案放在
# "https://github.com/loperntu/dsR/blob/gh-pages/lectures/07/birth2006.csv" 
# 有 31.1 MB, 也可以直接下載後再讀入
library(tidyverse)
birth <- tbl_df(read.csv("birth2006.csv", row.names = 1))
# or read the r data format (which reduces considerably the size of large files by compression) 
# birth <- readRDS(file = "birth2006.csv")
```

---
## 資料說明

`birth2006.csv`檔，包含了 427323 筆紀錄，與 13 筆變量。變量的意義如下：

- `DOB_MM` and `DOB_WK` : 出生的月份與星期。
 - `DBWT` : 北鼻的重量。
 - `WTGAIN`：母親妊娠期北鼻的增加重量。
 - `SEX` AND `APGAR5`：北鼻的性別和 APGAR5 分數。
 - `DPLURAL` ：胎數
 - `ESTGEST`：預估的懷孕週數。
 - `DMETH_REC`：剖腹產(c-section) 或自然產(vaginal)

---
## 探索分析初體驗

- 經過基本審視資料，可以發現 <span style="color:green; font-weight:bold">在週日(標號1)出生的北鼻比週間還少</span>. 請用程式產生數據說明這個事實，並且做出以適當的圖形。

- 對於上述事實有個猜想是就一般的狀況大部分的生產方式都是採「剖腹產」，而通常「剖腹產」的手術較少排在週日。請用程式產生數據說明這個事實。

- 如果我們要用圖形初步探索以下的變量之間是否有關聯，要如何利用`ggplot2`作圖?

- *胎數與生產方式與北鼻體重*
  - *北鼻的體重成長與出生時重量*