class: middle, inverse .leftcol30[ <center> <img src="https://github.com/emse-madd-gwu/emse-madd-gwu.github.io/raw/master/images/madd_hex_sticker.png" width=250> </center> ] .rightcol70[ # Week 2: .fancy[Data Wrangling] ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 512 512"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 6035: Marketing Analytics for Design Decisions ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> September 08, 2021 ] --- # Required Packages (check `notes.R` file) Make sure you have these libraries installed: ```r install.packages(c("tidyverse", "here")) ``` **Remember: you only need to install packages once!** -- <br> Once installed, you'll need to _load_ the libraries every time you open RStudio: ```r library(tidyverse) library(here) ``` --- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. Working with data frames ### 2. Data wrangling with the _tidyverse_ ### BREAK ### 3. Project proposals --- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. .orange[Working with data frames] ### 2. Data wrangling with the _tidyverse_ ### BREAK ### 3. Project proposals --- class: center # The data frame...in Excel <center> <img src="images/spreadsheet.png" width=700> </center> --- # The data frame...in R ```r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- ## **Columns**: _Vectors_ of values (must be same data type) ```r beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` -- Extract a column using `$` ```r beatles$firstName ``` ``` #> [1] "John" "Paul" "Ringo" "George" ``` --- ## **Rows**: Information about individual observations Information about _John Lennon_ is in the first row: ```r beatles[1,] ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE ``` -- Information about _Paul McCartney_ is in the second row: ```r beatles[2,] ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` --- class: middle, center, inverse ## Take a look at the `beatles` data frame in `notes.R` --- # Getting data into R <br> ## 1. Load external packages ## 2. Read in external files (usually a `.csv`* file) <br> *csv = "comma-separated values" --- ## Data from an R package ```r library(ggplot2) ``` -- See which data frames are available in a package: ```r data(package = "ggplot2") ``` -- Find out more about a package data set: ```r ?msleep ``` --- class: middle, center, inverse ## Back to `notes.R` --- # Importing an external data file <br> .leftcol60[ Note the `data.csv` file in your `data` folder. - **DO NOT** double-click it! - **DO NOT** open it in Excel! Excel can **corrupt** your data! ] -- .rightcol40[ If you **must** open it in Excel: - Make a copy - Open the copy ] --- # Steps to importing external data files -- ## 1. Create a path to the data ```r library(here) *path_to_data <- here('data', 'data.csv') path_to_data ``` ``` #> [1] "/Users/jhelvy/gh/0gw/MADD/2021-Fall/class/2-data-wrangling/data/data.csv" ``` -- ## 2. Import the data ```r library(tidyverse) *data <- read_csv(path_to_data) ``` --- ## Using the **here** package to make file paths The `here()` function builds the path to your **root** to your _working directory_ <br>(this is where your `.Rproj` file lives!) ```r here() ``` ``` #> [1] "/Users/jhelvy/gh/0gw/MADD/2021-Fall/class/2-data-wrangling" ``` -- The `here()` function builds the path to files _inside_ your working directory ```r path_to_data <- here('data', 'data.csv') path_to_data ``` ``` #> [1] "/Users/jhelvy/gh/0gw/MADD/2021-Fall/class/2-data-wrangling/data/data.csv" ``` --- # Avoid hard-coding file paths! ### (they can break on different computers) ```r path_to_data <- 'data/data.csv' path_to_data ``` ``` #> [1] "data/data.csv" ``` # 💩💩💩 --- # Back to reading in data ```r path_to_data <- here('data', 'data.csv') *data <- read_csv(path_to_data) ``` <br> **Important**: Use `read_csv()` instead of `read.csv()` --- class: inverse
10
:
00
## Think-Pair-Share .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`. 2) Use the `data` object to answer the following questions: - How many rows and columns are in the data frame? - What type of data is each column? (Just look, don't need to type out the answer) - Preview the different columns - what do you think this data is about? What might one row represent? - How many unique airlines are in the data frame? - What is the earliest and latest observation in the data frame? - What is the shortest and longest air time for any one flight in the data frame? ] --- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. Working with data frames ### 2. .orange[Data wrangling with the _tidyverse_] ### BREAK ### 3. Project proposals --- class: center ### The tidyverse: `stringr` + `dplyr` + `readr` + `ggplot2` + ... <center> <img src="images/horst_monsters_tidyverse.jpeg" width="950"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- class: center # 80% of the job is data wrangling .noborder[<center> <img src="images/data-wrangle.png" width="1000"> </center>] --- class: center ## Today: data wrangling with **dplyr** <center> <img src="images/horst_monsters_data_wrangling.png" width="600"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # .center[The main `dplyr` "verbs"] <br> "Verb" | What it does --------------|-------------------- `select()` | Select columns by name `filter()` | Keep rows that match criteria `arrange()` | Sort rows based on column(s) `mutate()` | Create new columns `summarize()` | Create summary values --- # .center[Core `tidyverse` concept:<br>**Chain functions together with "pipes"**] # .center[`%>%`] -- ## Think of the words "...and then..." ```r data %>% do_something() %>% do_something_else() ``` --- # Think of `%>%` as the words "...and then..." -- **Without Pipes** (read from inside-out): ```r leave_house(get_dressed(get_out_of_bed(wake_up(me)))) ``` -- **With Pipes**: ```r me %>% wake_up %>% get_out_of_bed %>% get_dressed %>% leave_house ``` --- class: center, middle, inverse # Select columns with `select()` <br> <center> <img src="images/rstudio-cheatsheet-select.png" width="900"> </center> --- # Select columns with `select()` ```r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- # Select columns with `select()` Select the columns `firstName` & `lastName` ```r beatles %>% select(firstName, lastName) ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` --- # Select columns with `select()` Use the `-` sign to drop columns ```r beatles %>% select(-firstName, -lastName) ``` ``` #> # A tibble: 4 × 3 #> instrument yearOfBirth deceased #> <chr> <dbl> <lgl> #> 1 guitar 1940 TRUE #> 2 bass 1942 FALSE #> 3 drums 1940 FALSE #> 4 guitar 1943 TRUE ``` --- # Select columns with `select()` Select columns based on name criteria: - `ends_with()` = Select columns that end with a character string - `contains()` = Select columns that contain a character string - `matches()` = Select columns that match a regular expression - `one_of()` = Select column names that are from a group of names --- # Select columns with `select()` Select the columns that end with `"Name"`: ```r beatles %>% select(ends_with("Name")) ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` --- class: center, middle, inverse # Select rows with `filter()` <br> <center> <img src="images/rstudio-cheatsheet-filter.png" width="900"> </center> --- # Select rows with `filter()` Select the band members born after 1941 ```r #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE *#> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE *#> 4 George Harrison guitar 1943 TRUE ``` --- # Select rows with `filter()` Select the band members born after 1941 ```r beatles %>% filter(yearOfBirth > 1941) ``` ``` #> # A tibble: 2 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE #> 2 George Harrison guitar 1943 TRUE ``` --- # Select rows with `filter()` Select the band members born after 1941 **& are still living** ```r beatles %>% filter(yearOfBirth > 1941, deceased == FALSE) ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` -- ```r beatles %>% filter(yearOfBirth > 1941 & deceased == FALSE) ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` --- # .center[Logic operators for `filter()`] <br> Description | Example ------------|------------ Values greater than 1 | `value > 1` Values greater than or equal to 1 | `value >= 1` Values less than 1 | `value < 1` Values less than or equal to 1 | `value <= 1` Values equal to 1 | `value == 1` Values not equal to 1 | `value != 1` Values in the set c(1, 4) | `value %in% c(1, 4)` --- # Removing missing values Drop all rows where `variable` is `NA` ```r data %>% filter(!is.na(variable)) ``` --- # Combine `filter()` and `select()` Get the **first & last name** of members born after 1941 & are still living ```r beatles %>% filter(yearOfBirth > 1941, deceased == FALSE) %>% select(firstName, lastName) ``` ``` #> # A tibble: 1 × 2 #> firstName lastName #> <chr> <chr> #> 1 Paul McCartney ``` --- class: inverse
10
:
00
## Think-Pair-Share .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`. 2) Use the `data` object and the `select()` and `filter()` functions to answer the following questions: - Create a new data frame, `flights_fall`, that contains only flights that departed in the fall semester. - Create a new data frame, `flights_dc`, that contains only flights that flew to DC airports (Reagan or Dulles). - Create a new data frame, `flights_dc_carrier`, that contains only flights that flew to DC airports (Reagan or Dulles) and only the columns about the month and airline. - How many unique airlines were flying to DC airports in July? ] --- class: center, middle, inverse ## Create new variables with `mutate()` <br> <center> <img src="images/rstudio-cheatsheet-mutate.png" width="900"> </center> --- class: center background-color: #fff <center> <img src="images/horst_monsters_mutate.jpg" width="700"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # Create new variables with `mutate()` Use the `yearOfBirth` variable to compute the age of each band member ```r beatles %>% mutate(age = 2021 - yearOfBirth) ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased age #> <chr> <chr> <chr> <dbl> <lgl> <dbl> #> 1 John Lennon guitar 1940 TRUE 81 #> 2 Paul McCartney bass 1942 FALSE 79 #> 3 Ringo Starr drums 1940 FALSE 81 #> 4 George Harrison guitar 1943 TRUE 78 ``` --- # You can _immediately_ use new variables ```r beatles %>% mutate( age = 2021 - yearOfBirth, * meanAge = mean(age)) ``` ``` #> # A tibble: 4 × 7 #> firstName lastName instrument yearOfBirth deceased age meanAge #> <chr> <chr> <chr> <dbl> <lgl> <dbl> <dbl> #> 1 John Lennon guitar 1940 TRUE 81 79.8 #> 2 Paul McCartney bass 1942 FALSE 79 79.8 #> 3 Ringo Starr drums 1940 FALSE 81 79.8 #> 4 George Harrison guitar 1943 TRUE 78 79.8 ``` --- # .center[Handling if/else conditions] ### .center[`ifelse(<condition>, <if TRUE>, <else>)`] -- ```r beatles %>% mutate(playsGuitar = ifelse(instrument == "guitar", TRUE, FALSE)) ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased playsGuitar #> <chr> <chr> <chr> <dbl> <lgl> <lgl> #> 1 John Lennon guitar 1940 TRUE TRUE #> 2 Paul McCartney bass 1942 FALSE FALSE #> 3 Ringo Starr drums 1940 FALSE FALSE #> 4 George Harrison guitar 1943 TRUE TRUE ``` --- # Sort data frame with `arrange()` -- Sort `beatles` data frame by year of birth ```r beatles %>% arrange(yearOfBirth) ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Ringo Starr drums 1940 FALSE #> 3 Paul McCartney bass 1942 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- # Sort data frame with `arrange()` Use the `desc()` function to sort in descending order ```r beatles %>% * arrange(desc(yearOfBirth)) ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 George Harrison guitar 1943 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 John Lennon guitar 1940 TRUE #> 4 Ringo Starr drums 1940 FALSE ``` --- # Sort rows with `arrange()` Compute the band member age, then sort based on the youngest: ```r beatles %>% mutate(age = 2021 - yearOfBirth) %>% arrange(age) ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased age #> <chr> <chr> <chr> <dbl> <lgl> <dbl> #> 1 George Harrison guitar 1943 TRUE 78 #> 2 Paul McCartney bass 1942 FALSE 79 #> 3 John Lennon guitar 1940 TRUE 81 #> 4 Ringo Starr drums 1940 FALSE 81 ``` --- class: inverse
10
:
00
## Think pair share .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`. 2) Using the `data` object, create the following new variables: - `speed`: The speed in mph, computed from the time (in minutes) and distance (in miles) variables. - `dep_delay_hour`: Is `TRUE` if the departure delay is greater or equal to one hour, `FALSE` otherwise. 3) Which flight flew the fastest? 4) Remove rows that have `NA` for `air_time` and re-arrange the resulting data frame based on the longest air time and longest flight distance. ] --- class: inverse, center # .fancy[Break]
05
:
00
--- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. Working with data frames ### 2. Data wrangling with the _tidyverse_ ### BREAK ### 3. .orange[Project proposals] --- class: center, middle # [Project Proposal Guidelines](https://madd.seas.gwu.edu/2021-Fall/p1-proposal.html) --- # Proposal Items Item | Description ---- | ------------------------------------ **Abstract** | Product / technology in just a few sentences **Introduction** | Description, picture, background **Market Opportunity** | Identify your customer, competitors, and market size **Product Attributes & Decision Variables** | 2-4 key variables related to product's design and performance **Questions** | Major outstanding questions to be resolved --- # Today .cols3[ ### Market Opportunity - Identify customer - Identify competitors - Identify market size ] -- .cols3[ ### Product Attributes Features your _customer_ cares about ] -- .cols3[ ### Decision Variables Features that _the designer_ cares about ] --- class: middle background-color: #fff # .center[Example: **Folding solar panels**] .leftcol60[ <center> <img src="images/solar.jpg" width=800> </center> ] .rightcol40[ ### Who is your customer? - General public? - Outdoor enthusiasts? - Emergency gear? ### Competitors? - Similar folding panels - Batteries? ] --- class: middle background-color: #fff # .center[Example: **Electric vehicle battery**] .leftcol60[ <center> <img src="images/ev.png" width=100%> </center> ] .rightcol40[ ### Who is your customer? - Car buyers ### Competitors? - Hybrid vehicles? - Efficient gasoline vehicles? ] --- class: center, middle .leftcol[ ## Product Attributes #### Features your _customer_ cares about ] .rightcol[ ## Decision Variables #### Features that _the designer_ cares about ] --- background-image: url(images/solar1.png) background-size: contain --- class: center # Model Relationships Table ([example](https://docs.google.com/spreadsheets/d/1iwMI9cbJjB6J8wghZY6Y_fCOSt7MDsUTnSuuC58_xjU/edit?usp=sharing)) .border[ <center> <img src="images/solar2.png" width=1100> </center> ] --- class: inverse
15
:
00
## Team Proposals 1. Re-arrange tables to sit with your team 2. Discuss & identify your customer & potential competitors 3. Discuss & identify key _Product Attributes_ & _Decision Variables_ 4. Start building out your model relationships table (copy from [this example](https://docs.google.com/spreadsheets/d/1iwMI9cbJjB6J8wghZY6Y_fCOSt7MDsUTnSuuC58_xjU/edit?usp=sharing)) ### Suggestions - You may want to start with simple bullet lists - Start with more items rather than fewer (can always cut back later)