Data Wrangling

Due: Sep 02 by 11:59pm

Weight: This assignment is worth 3% of your final grade.

Purpose: The purpose of this assignment is to get more familiar with R and RStudio and to develop some basic strategies for working with data in R.

Assessment: This assignment is graded using a check system:

  • ✔+ (110%): Responses shows phenomenal thought and engagement with the course content. I will not assign these often.
  • ✔ (100%): Responses are thoughtful, well-written, and show engagement with the course content. This is the expected level of performance.
  • ✔− (50%): Responses are hastily composed, too short, and/or only cursorily engages with the course content. This grade signals that you need to improve next time. I will hopefully not assign these often.

Notice that this is essentially a pass/fail system. I’m not grading your writing ability and I’m not counting the number of words you write - I’m looking for thoughtful engagement. One or two sentences is not enough. Write at least a paragraph and show me that you did the readings assigned.

1. Software

If you haven’t yet, go to the Software page and install all the software we’ll need for this course. You’ll need these tools for this assignment.

2. Getting Organized

Download and edit this template when working through this assignment. This is for now mostly a blank file that you can use to jot down examples and play with code.

3. Readings

Open up a notebook (physical, digital…whatever you take notes in best), and take notes while you go through these readings:

  1. Getting Familiar with the Course: Follow Snoop’s advice and read the entire Course Syllabus (actually read the whole thing). Then review the schedule and make sure to note important upcoming deadlines.
  2. Basics [Optional] Read through Lessons 1 “Getting Started” and 2 “Data Types & Vectors” in the R4A Primer to get more familiar with basics.
  3. Data Frames & Data Wrangling Reading through Lessons 3 “Data Frames” and 4 “Data Wrangling” in the R4A Primer to get more familiar with working with data sets in .

4. Exercises

Posit offers many excellent recipes for implementing lots of common things in . Running through these exercises will help prepare you for class next week.

Pick at least 3 recipes under “Transform Tables” and “Visualize Data” (so 6 total) and try implementing them yourself. Ideally you could try even more.

5. Reflect

Reflect on what you’ve learned while going through these readings and exercises. Is there anything that jumped out at you? Anything you found particularly interesting or confusing? Write at least a paragraph in your hw1.R file, and include at least one question. The teaching team will review the questions we get and will try to answer them either in Slack or in class.

If you’re unsure where to start with a reflection, try filling out this template:

“I used to think ______, now I think ______ 🤔”

6. Submit

To submit your assignment, create a zip file of all the files in your R project folder for this assignment and submit it on the corresponding assignment submission on Blackboard.


Extra Practice

Not required, but probably helpful, especially if you’re new to .

Inspect data from other packages

Write R code to install the dslabs package from CRAN, then write code to load the library. Write some code to preview and inspect the movielens data frame that gets loaded when you load the library using some of the techniques we saw in class. For each of the following questions, write code to find your answer and leave a detailed response in a comment:

  • What is this dataset about?
  • How many observations are in the data frame?
  • What is the original source of the data?
  • What type of data is each variable?
  • What are the years of the earliest and most recent observations in the data set?

Answer questions about the data

For each of the following questions, write code to find your answer and leave a detailed response in a comment:

  • What is the min, mean, and max rating in the data set?
  • How many observations received the maximum rating?
  • What percentage of total observations received the maximum rating?
  • What is the title of the observation with the longest title (in terms of numbers of letters in the title)?

Installing packages from Github: the BRRR library

The vast majority of the time, you will install external packages using the install.packages() function. This installs packages from the Comprehensive R Archive Network (CRAN), where most packages are published. But you can also install packages that are under development or haven’t been published to CRAN yet. Most of the time, these packages are hosted on GitHub - an online platform for sharing code (it’s also where all of the files that make up this website are stored).

To install a package from GitHub, you first need to install the remotes library. Then you can use the remotes::install_github()` function to install packages directly from GitHub. To try this out, install the remotes library, then trying installing the BRRR package:

remotes::install_github("brooke-watson/BRRR")
Note

Packages on GitHub are in development and often require other packges to work. So if you get an installation error about some other package dependency, try restarting your R session and try again.

Not sure what this package does? Well, one of the other nice things about packages listed on GitHub is the authors tend to write detailed descriptions - check out the GitHub page for the BRRR package. Then try using the BRRR::skrrrahh() function with different number arguments (turn your volume up). In the #welcome channel on slack, post your favorite argument to skrrrahh() (mine is 24).