Exploratory Data Analysis (1)

Introducing Data Science with R, week 6 2019-10-17


在了解基本的 R語法之後,我們要進入整頓資料 (data wrangling/transformation)(進而理解資料)的學習階段。

首先你會拿到資料,但是不整齊 (non-tidy) 的資料讓妳的人生被浪費。

幸福的人都很類似,不幸的人則各有各的不幸。 Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

那整齊的資料是什麼意思?

Tidy data

三原則

  1. 每個變數都必須有自己的資料欄 Each variable forms a column.
  2. 每個觀察都必須有自己的資料列 Each observation forms a row.
  3. 每個值都必須有自己的格子 Each value must have its own cell.

Tidy data processing

dplyr

Basics

Data manipulation 五個基本動作:filter(), select(), arrange(), mutate(), summarise()。 注意:直行(VAR)橫列 (OBS)

  • 篩選 filter(): 按給定的邏輯判斷,篩選出符合要求的 OBS, 類似於 subset()
  • 選擇 select() : 用 VAR 作參數來選擇 OBS。
  • 排列 arrange(): 按給定的 VAR 依次對 OBS 進行排序。類似於 order()
  • 增行 mutate(): 對已有 VAR 進行運算並添加為新的 VAR。類似於 transform()
  • 摘要 summarise(): 對data frame 調用其它函數進行 summarise, 並回傳一維結果。

First argument is a data frame, and subsequent arguments say what to do with data frame.

# load packages
suppressMessages(library(dplyr))
# install.packages(c("hflights", "Lahman"))
library(hflights)

# explore data: flights departing from two Houston airports in 2011
data(hflights)
head(hflights,5)
  • tbl_df() creates a 「local data frame」, which is simply a wrapper for a data frame that prints nicely.
# convert to local data frame
flights <- tbl_df(hflights)
# alternative: head(data.frame(flights),3)

# printing only shows 10 rows and as many columns as can fit on your screen
flights

# you can specify that you want to see more rows
#print(flights, n=20)

# convert to a normal data frame to see all of the columns
#data.frame(head(flights))

filter (keep rows marching criteria) : 篩選觀察

  • Base R approach to filtering forces you to repeat the data frame’s name, and dplyr approach is simpler to write and read.
  • filter(df, 回傳符合邏輯條件的 rows)
# base R approach to view all flights on January 1
#flights[flights$Month==1 & flights$DayofMonth==1, ]

# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1 & DayofMonth==1)
# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")

# you can also use %in% operator
#filter(flights, UniqueCarrier %in% c("AA", "UA"))

select: Pick columns by name 選取變量

  • Base R approach is awkward to type and to read, dplyr approach uses similar syntax to filter.
# base R approach to select DepTime, ArrTime, and FlightNum columns
#flights[, c("DepTime", "ArrTime", "FlightNum")]

# dplyr approach
select(flights, DepTime, ArrTime, FlightNum)
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
# 或者使用 - 來排除某列
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

Chaining/Pipelining - Usual way to perform multiple operations in one line is by nesting. - Can write commands in a natural order by using the %>% infix operator (which can be pronounced as 「then」) - Chaining increases readability significantly when there are many commands (Operator is automatically imported from the magrittr package)

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

# chaining method
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    filter(DepDelay > 60)

再一個例子

# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))

# chaining method
(x1-x2)^2 %>% 
  sum() %>% 
  sqrt()
  • rename() 重命名變量 variable names rename(tbl, newname = oldname,...)

arrange(): 排序觀察 Reorder rows

  • arrange(依 variable 或是 variables 來排序), desc() 是倒序。
# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
#flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]

# dplyr approach
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(DepDelay)

# use `desc` for descending
# flights %>%
#     select(UniqueCarrier, DepDelay) %>%
#     arrange(desc(DepDelay))
# 

mutate(): Add new variables 擴展新 variable(在後面)

  • Create new variables that are functions of existing variables, 類似於transform()
# base R approach to create a new variable Speed (in mph)
#flights$Speed <- flights$Distance / flights$AirTime*60
#flights[, c("Distance", "AirTime", "Speed")]

# dplyr approach (prints the new variable but does not store it)
flights %>%
    select(Distance, AirTime) %>%
    mutate(Speed = Distance/AirTime*60)
# store the new variable
flights <- flights %>% 
  mutate(Speed = Distance/AirTime*60)

summarise(): Reduce variables to values

  • Primarily useful with data that has been grouped by one or more variables
  • summarise() uses the provided aggregation function to summarise each group

Group Data

  • 以上功能加上分組操作group_by()這個概念結合起來時非常強大!
# base R approaches to calculate the average arrival delay to each destination
# head(with(
#   flights, 
#   tapply(ArrDelay, Dest, mean, na.rm=TRUE))
#   )
# head(aggregate(ArrDelay ~ Dest, flights, mean))


# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
  • summarise_each() allows you to apply the same summary function to multiple columns at once (Note: mutate_each() is also available)
# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted)
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
  • Helper function n() counts the number of rows in a group
  • Helper function n_distinct(vector) counts the number of unique items in that vector
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count))
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
    group_by(Dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(TailNum))
  • Grouping can sometimes be useful without summarising
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head()
  • Summarise function takes n inputs and returns 1 value
  • Window function takes n inputs and returns n values. Includes ranking and ordering functions (like min_rank()), offset functions (lead() and lag()), and cumulative aggregates (like cummean()).

ggplot2

#install.packages('gtrendsR')
library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)

#install.packages('plotly')
library(plotly)
p <-plot(trends)

ggplotly(p)

ggplot2 的基本文法

  • gg 代表 grammar of graphics

    • (data, aesthetics) + geometry

      • data: a data frame
      • aesthetics: used to indicate x and y variables, also used to control the color, size, shape of points, heights of bars, etc.
      • geometry: corresponds to the type of graphics (histogram, box plot,…)
library(ggplot2)
gg <- ggplot(diamonds, aes(price, carat)) +
  geom_point(color = "brown4") # scatter plot; size=1.5, shape=18

gg

  • Build plots steps by steps
gg <- gg + ggtitle("Diamond carat and price")

Rapid Data Exploration with dplyr and ggplot2

diamonds %>%                      # Start with the 'diamonds' dataset
  filter(cut == "Ideal") %>%      # Then, filter down to rows where cut == Ideal
  ggplot(aes(x=color,y=price)) +  # Then, plot using ggplot
    geom_boxplot()                #  with and create a boxplot

再回來玩 gtrendsR

library(gtrendsR)

# https://rdrr.io/cran/gtrendsR/man/gtrends.html

#define the keywords
keywords=c("Paris","New York","Barcelona")
#set the geographic area: DE = Germany
country=c('TW')
#set the time window
time=("2010-01-01 2018-08-27")
#set channels 
channel='web'



trends = gtrends(keywords, gprop =channel,geo=country, time = time )
#select only interst over time 
time_trend=trends$interest_over_time
head(time_trend)


plot<-ggplot(data=time_trend, aes(x=date, y=hits,group=keyword,col=keyword))+
  geom_line()+xlab('Time')+ylab('Relative Interest')+ theme_bw()+
  theme(legend.title = element_blank(),legend.position="bottom",legend.text=element_text(size=12))+ggtitle("Google Search Volume")
plot

中文資料參考

http://molecular-service-science.com/2013/11/27/r-ggplot-tutorial-1/

http://molecular-service-science.com/2014/01/23/r-ggplot-tutorial-2/