Meet the toolkit

Tyler George

Cornell College
DSC 223 - Spring 2024 Block 7

Warm-up

Announcements

Course homepage

Let’s take a tour!

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively.

  • Labs will be completed in groups. You should work with each other during, and sometimes outside of class, to complete the labs. You will be forced to work through each person in your team during class via mounted TVs.

  • Homework must be submitted individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

  • Exams must be completed individually. You may not discuss any aspect of the exam with peers. If you have questions, email me, especially if you get stuck on an usual problem (not a coding error).

Sharing / reusing code policy

  • I are aware that a huge volume of code is available on the web, and many tasks may have solutions posted

  • Unless explicitly stated otherwise, this course’s policy is that you may make use of any online resources (e.g. RStudio Community, StackOverflow, etc.) but you must explicitly cite where you obtained any code you directly use or use as inspiration in your solution(s).

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source

Use of generative AI

  • Treat generative AI, such as ChatGPT, the same as other online resources.

  • Guiding principles:

    • (1) Cognitive dimension: Working with AI should not reduce your ability to think clearly. We will practice using AI to facilitate—rather than hinder—learning.

    • (2) Ethical dimension: Students using AI should be transparent about their use and make sure it aligns with academic integrity.

  • ✅ AI tools for code: You may make use of the technology for coding examples on assignments; if you do so, you must explicitly cite where you obtained the code.

  • ❌ AI tools for narrative: Unless instructed otherwise, you may not use generative AI to write narrative on assignments. In general, you may use generative AI as a resource as you complete assignments but not to answer the exercises for you.

Most importantly!

Ask if you’re not sure if something violates a policy!

Five tips for success

  1. Complete all the preparation work before class.

  2. Ask questions.

  3. Do the readings.

  4. Do the lab.

  5. Do the Homework

  6. Don’t procrastinate! There is no time for falling behind on the block!

Course toolkit

Course toolkit

Course operation

Doing data science

  • Computing:
    • R
    • RStudio
    • tidyverse
    • Quarto
  • Version control and collaboration:
    • Git
    • GitHub

Toolkit: Computing

Learning goals

By the end of the course, you will be able to…

  • ethically gain data
  • ethically gain insight from data
  • ethically gain insight from data, reproducibly
  • ethically gain insight from data, reproducibly, using modern programming tools and techniques
  • ethically gain insight from data, reproducibly and collaboratively, using modern programming tools and techniques
  • ethically gain insight from data, reproducibly (with literate programming and version control) and collaboratively, using modern programming tools and techniques

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Short-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
  • Version control \(\rightarrow\) Git / GitHub

R and RStudio

R and RStudio

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R vs. RStudio

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

R packages

  • Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1

  • As of March 14th, 2024, there are 20,582 R packages available on CRAN (the Comprehensive R Archive Network)2

  • We’re going to work with a small (but important) subset of these!

Tour: R + RStudio

Option 1:

Sit back and enjoy the show!

Option 2:

Clone the corresponding application exercise repo and follow along.

ae-01-meet-the-penguins

Go to the course GitHub organization and clone ae-01-meet-the-penguins-YOUR_USERNAME in RStudio on the server. http://turing.cornellcollege.edu:8787/

Tour recap: R + RStudio

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Packages are installed with the install.packages() function and loaded with the library function, once per session:
install.packages("package_name")
library(package_name)

R essentials (continued)

  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Object documentation can be accessed with ?
?mean

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The tidyverse is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Quarto

Quarto

  • Fully reproducible reports – each time you render the analysis is ran from the beginning
  • Code goes in chunks narrative goes outside of chunks
  • A visual editor for a familiar / Google docs-like editing experience

Tour: Quarto

Option 1:

Sit back and enjoy the show!

Option 2:

ae-01-meet-the-penguins

Go to the course GitHub organization and clone ae-01-meet-the-penguins-YOUR_USERNAME in RStudio on the server. http://turing.cornellcollege.edu:8787/

Tour recap: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

Environments

Important

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments

First, run the following in the console:

x <- 2
x * 3


All looks good, eh?

Then, add the following in an R chunk in your Quarto document

x * 3


What happens? Why the error?

How will we use Quarto?

  • Every application exercise, lab, project, etc. is an Quarto document
  • You’ll always have a template Quarto document to start with
  • The amount of scaffolding in the template will decrease over the block

What’s with all the hexes?

Hex logos for many packages