Data Wrangling

.leftcol30[
<center>
<img src="https://github.com/emse-madd-gwu/emse-madd-gwu.github.io/raw/master/images/madd_hex_sticker.png" width=250>
</center>
]

### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 512 512"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 6035: Marketing Analytics for Design Decisions
### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston
### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> September 08, 2021
]

---

# Required Packages (check `notes.R` file)

Make sure you have these libraries installed:

```r
install.packages(c("tidyverse", "here"))
```

**Remember: you only need to install packages once!**

<br>

Once installed, you'll need to _load_ the libraries every time you open RStudio:

```r
library(tidyverse)
library(here)
```

---

# Week 2: .fancy[Data Wrangling]

### 1. Working with data frames
### 2. Data wrangling with the _tidyverse_

### BREAK

### 3. Project proposals

---

# Week 2: .fancy[Data Wrangling]

### 1. .orange[Working with data frames]
### 2. Data wrangling with the _tidyverse_

### BREAK

### 3. Project proposals

---

# The data frame...in Excel

---

# The data frame...in R

```r
beatles <- tibble(
    firstName   = c("John", "Paul", "Ringo", "George"),
    lastName    = c("Lennon", "McCartney", "Starr", "Harrison"),
    instrument  = c("guitar", "bass", "drums", "guitar"),
    yearOfBirth = c(1940, 1942, 1940, 1943),
    deceased    = c(TRUE, FALSE, FALSE, TRUE)
)

beatles
```

```
#> # A tibble: 4 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 John      Lennon    guitar            1940 TRUE    
#> 2 Paul      McCartney bass              1942 FALSE   
#> 3 Ringo     Starr     drums             1940 FALSE   
#> 4 George    Harrison  guitar            1943 TRUE
```

---

## **Columns**: _Vectors_ of values (must be same data type)

```r
beatles
```

Extract a column using `$`

```r
beatles$firstName
```

```
#> [1] "John"   "Paul"   "Ringo"  "George"
```

---

## **Rows**: Information about individual observations

Information about _John Lennon_ is in the first row:

```r
beatles[1,]
```

```
#> # A tibble: 1 × 5
#>   firstName lastName instrument yearOfBirth deceased
#>   <chr>     <chr>    <chr>            <dbl> <lgl>   
#> 1 John      Lennon   guitar            1940 TRUE
```

Information about _Paul McCartney_ is in the second row:

```r
beatles[2,]
```

```
#> # A tibble: 1 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 Paul      McCartney bass              1942 FALSE
```

---

## Take a look at the `beatles` data frame in `notes.R`

---

# Getting data into R

<br>

## 1. Load external packages
## 2. Read in external files (usually a `.csv`* file)

<br>

*csv = "comma-separated values"

---

## Data from an R package

```r
library(ggplot2)
```

See which data frames are available in a package:

```r
data(package = "ggplot2")
```

Find out more about a package data set:

```r
?msleep
```

---

## Back to `notes.R`

---

# Importing an external data file

<br>

- **DO NOT** double-click it!
- **DO NOT** open it in Excel!

Excel can **corrupt** your data!
]

- Make a copy 
- Open the copy
]

---

# Steps to importing external data files

## 1. Create a path to the data

```r
library(here)
*path_to_data <- here('data', 'data.csv')
path_to_data
```

```
#> [1] "/Users/jhelvy/gh/0gw/MADD/2021-Fall/class/2-data-wrangling/data/data.csv"
```

## 2. Import the data

```r
library(tidyverse)
*data <- read_csv(path_to_data)
```

---

## Using the **here** package to make file paths

The `here()` function builds the path to your **root** to your _working directory_ <br>(this is where your `.Rproj` file lives!)

```r
here()
```

```
#> [1] "/Users/jhelvy/gh/0gw/MADD/2021-Fall/class/2-data-wrangling"
```

The `here()` function builds the path to files _inside_ your working directory

```r
path_to_data <- here('data', 'data.csv')
path_to_data
```

```
#> [1] "/Users/jhelvy/gh/0gw/MADD/2021-Fall/class/2-data-wrangling/data/data.csv"
```

---

# Avoid hard-coding file paths!

### (they can break on different computers)

```r
path_to_data <- 'data/data.csv'
path_to_data
```

```
#> [1] "data/data.csv"
```

# 💩💩💩

---

# Back to reading in data

```r
path_to_data <- here('data', 'data.csv')
*data <- read_csv(path_to_data)
```

<br>

**Important**: Use `read_csv()` instead of `read.csv()`

---

## Think-Pair-Share

.font90[
1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`.

2) Use the `data` object to answer the following questions:

- How many rows and columns are in the data frame?
- What type of data is each column? (Just look, don't need to type out the answer)
- Preview the different columns - what do you think this data is about? What might one row represent?
- How many unique airlines are in the data frame? 
- What is the earliest and latest observation in the data frame?
- What is the shortest and longest air time for any one flight in the data frame?
]

---

# Week 2: .fancy[Data Wrangling]

### 1. Working with data frames
### 2. .orange[Data wrangling with the _tidyverse_]

### BREAK

### 3. Project proposals

---

### The tidyverse: `stringr` + `dplyr` + `readr` +  `ggplot2` + ...

<center>
<img src="images/horst_monsters_tidyverse.jpeg" width="950">
</center>Art by [Allison Horst](https://www.allisonhorst.com/)

---

# 80% of the job is data wrangling

---

## Today: data wrangling with **dplyr**

<center>
<img src="images/horst_monsters_data_wrangling.png" width="600">
</center>Art by [Allison Horst](https://www.allisonhorst.com/)

---

# .center[The main `dplyr` "verbs"]

<br>

"Verb"        | What it does
--------------|--------------------
`select()`    | Select columns by name
`filter()`    | Keep rows that match criteria
`arrange()`   | Sort rows based on column(s)
`mutate()`    | Create new columns 
`summarize()` | Create summary values

---

# .center[Core `tidyverse` concept:<br>**Chain functions together with "pipes"**]

# .center[`%>%`]

## Think of the words "...and then..."

```r
data %>% 
  do_something() %>% 
  do_something_else()
```

---

# Think of `%>%` as the words "...and then..."

**Without Pipes** (read from inside-out):

```r
leave_house(get_dressed(get_out_of_bed(wake_up(me))))
```

**With Pipes**:

```r
me %>%
    wake_up %>%
    get_out_of_bed %>%
    get_dressed %>%
    leave_house
```

---