Lab 3

The goal of this lab is to solidify some of the skills and concepts from the first two weeks of the course.

Getting started

  1. Use these steps to navigate to the STA101 version of RStudio using the Duke Container Manager;

  2. Use these steps to download all of the lab 3 files from our Canvas page, upload them to your RStudio files, and move them into an appropriately named folder (lab-3, for instance);

  3. Once lab-3.qmd is where it needs to be, open it, and verify that you can click the “Render” button in RStudio and get a PDF file. See this answer on Ed if you want more guidance here;

  4. Now proceed to complete the exercises in this lab.

Pointers

  • For all visualizations you create, be sure to include informative titles for the plot, axes, and legend;
  • Respond in complete sentences as much as possible;
  • Be sure to observe good code style:
    • There is a line break after each |> in a pipeline or + in a ggplot;
    • There are spaces around = signs;
    • There is a space after each ,;
    • Code is properly indented;
    • Code doesn’t exceed 80 characters in each line, longer lines of code are spread across multiple lines with appropriately placed line breaks (so in the rendered PDF, your code shouldn’t run off the page);
    • Code chunks are labeled, informatively and without spaces.

Packages

In this lab we will work with the tidyverse packages, which is a collection of packages for doing data analysis in a “tidy” way.

Part 1: Nobel laureates

We will now consider a data set on the characteristics of winners (laureates) of a Nobel prize: nobel.csv. You can read it in using the following.

nobel <- read_csv("data/nobel.csv")
Note

Recall that you may have to adjust the file path to match how you have organized your files and folders.

The descriptions of the variables are as follows:

  1. id: ID number
  2. firstname: First name of laureate
  3. surname: Surname
  4. year: Year prize won
  5. category: Category of prize
  6. affiliation: Affiliation of laureate
  7. city: City of laureate in prize year
  8. country: Country of laureate in prize year
  9. born_date: Birth date of laureate
  10. died_date: Death date of laureate
  11. gender: Gender of laureate
  12. born_city: City where laureate was born
  13. born_country: Country where laureate was born
  14. born_country_code: Code of country where laureate was born
  15. died_city: City where laureate died
  16. died_country: Country where laureate died
  17. died_country_code: Code of country where laureate died
  18. overall_motivation: Overall motivation for recognition
  19. share: Number of other winners award is shared with
  20. motivation: Motivation for recognition

In a few cases the name of the city/country changed after laureate was given (e.g. in 1975 Bosnia and Herzegovina was called the Socialist Federative Republic of Yugoslavia). In these cases the variables below reflect a different name than their counterparts without the suffix _original.

  1. born_country_original: Original country where laureate was born
  2. born_city_original: Original city where laureate was born
  3. died_country_original: Original country where laureate died
  4. died_city_original: Original city where laureate died
  5. city_original: Original city where laureate lived at the time of winning the award
  6. country_original: Original country where laureate lived at the time of winning the award

There are some observations in this dataset that we will exclude from our analysis. This code creates a new data frame called nobel_living_science that filters for

  • laureates for whom country is available: !is.na(country);
  • laureates who are people as opposed to organizations, i.e., organizations are denoted with "org" as their gender: gender != "org";
  • laureates who are still alive, i.e., their died_date is NA: is.na(died_date).
  • laureates in the sciences (so not literature or peace).
nobel_living_science <- nobel |>
  filter(!is.na(country) & gender != "org" & is.na(died_date)) |>
  filter(category %in% c("Physics", "Medicine", "Chemistry", "Economics"))

Once you have filtered for these characteristics you are left with a data frame with 228 observations (check this!).

Most living Nobel laureates were based in the US when they won their prizes

… says this Buzzfeed article. Let’s see if that’s true.

First, we’ll create a new variable to identify whether the laureate was in the US when they won their prize. We’ll use the mutate() function for this. The following pipeline mutates the nobel_living_science data frame by adding a new variable called country_us. We use an if statement to create this variable. The first argument in the if_else() function we’re using to write this if statement is the condition we’re testing for. If country is equal to "USA", we set country_us to "USA". If not, we set the country_us to "Other".

nobel_living_science <- nobel_living_science |>
  mutate(
    country_us = if_else(country == "USA", "USA", "Other")
  )

For the following exercises, work with the nobel_living_science data frame you created above.

Exercise 1

Create a faceted bar plot visualizing the relationship between the category of prize and whether the laureate was in the US when they won the Nobel prize. Interpret your visualization, and say a few words about whether the Buzzfeed headline is supported by the data.

  • Your visualization should be faceted by category;
  • For each facet you should have two bars, one for winners in the US and one for Other;
  • Flip the coordinates so the bars are horizontal, not vertical;
  • Make sure everything is clearly labeled!

Exercise 2

Next, let’s investigate, of those US-based Nobel laureates, what proportion were born in other countries.

Create a new variable called born_country_us in nobel_living_science that has the value "USA" if the laureate is born in the US, and "Other" otherwise. How many of the winners are born in the US?

Exercise 3

Add a second variable to your visualization from Exercise 1 based on whether the laureate was born in the US or not.

Create two visualizations with this new variable added:

  • Plot 1: Segmented frequency bar plot
  • Plot 2: Segmented relative frequency bar plot (Hint: Add position = "fill" to geom_bar().)

Here are some instructions that apply to both of these visualizations:

  • Your final visualization should contain a facet for each category.
  • Within each facet, there should be two bars for whether the laureate won the award in the US or not.
  • Each bar should have segments for whether the laureate was born in the US or not.

Which of these visualizations is a better fit for answering the following question: “Do the data appear to support Buzzfeed’s claim that of those US-based Nobel laureates, many were born in other countries?” First, state which plot you’re using to answer the question. Then, answer the question, explaining your reasoning in 1-2 sentences.

Part 2: practicing our coding basics

Exercise 4

The file nsw-crime.csv contains monthly data on all criminal incidents recorded by police in New South Wales, Australia (the state that includes the city of Sydney). So a row in this data set corresponds to a month, and a column corresponds to a crime (murder, dealing cannabis, escaping custody, etc). The data count the number of cases of each crime in each month. In this exercise we will consider the variable offensive_language, an incident of disorderly conduct where the perpetrator…said some things.

Load the crime data into R and create four histogram plots of the variable offensive_language. In each plot, use a different number of histogram bins: 20, 40, 80, and 160. Decide which picture you think best visualizes the distribution of the crime counts, and explain why you think this.

Exercise 5

Load the data about the COVID delta variant (adjusting your file path as needed):

delta <- read_csv("delta.csv")

Consider this sequence of commands:

# part 1

delta |>
  count(vaccine, outcome)

# part 2

delta |>
  count(vaccine, outcome) |>
  group_by(vaccine)

# part 3

delta |>
  count(vaccine, outcome) |>
  group_by(vaccine) |>
  mutate(prop = n / sum(n))

Pick this code apart like a vulture and explain in complete sentences what is going on in each part. The more detail, the better.

Part 3: IMS Exercises

The exercises in this section do not require code. Make sure to answer the questions in full sentences.

Exercise 6

IMS - Chapter 4 exercises, #4: Raise taxes.

Exercise 7

IMS - Chapter 5 exercises, #4: Office productivity.

Exercise 8

IMS - Chapter 5 exercises, #15: Distributions and appropriate statistics.

Exercise 9

IMS - Chapter 5 exercises, #26: NYC marathon winners.

Part 4: waxing poetic

Exercise 10

Describe a situation from your everyday life where you have to make a decision, but it is difficult to make a good decision because you face some uncertainty about the world. Common examples (which you should not now use!) might include deciding when to get food at the dining hall, because you are uncertain about the length of the lines, or deciding when to have a picnic, because you are uncertain about the weather. Next, describe some data (information) that, if you had access to it, would resolve some of the uncertainty you face and help you make a better decision. Describe two versions of the data: an ideal version that would theoretically resolve all of the uncertainty, and an imperfect version that you are more likely to actually encounter in practice. How would you use these data to guide your decision making?

Note

This is basically going to be graded for completion, but I hope some of you will get creative, and I look forward to reading these!

Lastly

Recommend some music for us to listen to while we grade this.

Wrap up

Submitting

Important

Before you proceed, first, make sure that you have updated the document YAML with your name! Then, render your document one last time, for good measure.

To submit your assignment to Gradescope:

  • Go to your Files pane and check the box next to the PDF output of your document (lab-3.pdf).

  • Then, in the Files pane, go to More > Export. This will download the PDF file to your computer. Save it somewhere you can easily locate, e.g., your Downloads folder or your Desktop.

  • Go to the course Canvas page and click on Gradescope and then click on the assignment. You’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”).

Warning

If you fail to mark the pages associated with an exercise, that exercise won’t be graded. This means, if you fail to mark the pages for all exercises, you will receive a 0 on the assignment. The TAs can’t mark your pages for you, and for them to be able to grade, you must mark them.

Grading

Exercise Points
Exercise 1 8
Exercise 2 6
Exercise 3 8
Exercise 4 6
Exercise 5 6
Exercise 6 2
Exercise 7 2
Exercise 8 5
Exercise 9 4
Exercise 10 3
Total 50