CPP 526: Foundations of Data Science I




This lab offers practice with some basic R functions for summarizing vectors.

I have provided you with a LAB-01 RMD template:




Functions

You will use the following functions for this lab:

names()                 # variable names
head()                  # preview dataset
length()                # vector length (number of elements)
dim(), nrow(), ncol()   # dataset dimensions
sum(), summary()        # summarize numeric vectors
table()                 # summarize factors / character vectors



Data

This lab uses city tax parcel data from Syracuse, NY. [ Data Dictionary ]



Loading Data Into R

You can load the dataset by including the following code chunk in your file:

URL <- "https://raw.githubusercontent.com/DS4PS/Data-Science-Class/master/DATA/syr_parcels.csv"
dat <- read.csv( URL, stringsAsFactors=FALSE )

Note that referencing variables in R requires both the dataset name and variable name, separated by the $ operator:

summary( dat$acres )

Unlike other stats programs, you can have several datasets loaded at the same time in R. They will often have variables with the same name (if you create a subset, for example, and save it as a new object you will have two datasets with identical names). To avoid conflicts R forces you to use the dataset$variable convention.



Lab Instructions

Answer the following questions using the Syracuse parcels dataset and the functions listed.

Your solution should include a written response to the question, as well as the code used to generate the result.


1. How many tax parcels are in Syracuse?

dataset dimensions: dim() or nrow()


2. How many acres of land are in syracuse?

sum() over the numeric acres vector


3. How many vacant BUILDINGS are there in Syracuse?

sum() over the vacantbuil logical vector


4. What proportion of parcels are tax exempt?

sum() plus length() functions withthe logical tax.exempt vector


5. Which neighborhood contains the most tax parcels?

table() with the neighborhood variable


6. Which neighborhood contains the most vacant LOTS?

table() with the neighborhood and land_use variables



HELPFUL HINTS:

When you apply a sum() function to a numeric vector it returns the sum of all elements in the vector.

sum( c(10,20,5) )  # 35

When you apply a sum() function to a logical vector, it will count all of the TRUEs:

x <- c( TRUE, TRUE, FALSE, FALSE, FALSE )
sum( x )                # number of TRUEs
sum( x ) / length( x )  # proportion of TRUEs

R wants to make sure you are aware of missing values, so it will return NA (not available) for functions performed on vectors with missing values.

Add the ‘NA remove’ argument (na.rm=TRUE) to functions to ignore missing values:

sum( dat$star, na.rm=T )



How to Submit

Use the following instructions to submit your assignment, which may vary depending on your course’s platform.


Knitting to HTML

When you have completed your assignment, click the “Knit” button to render your .RMD file into a .HTML report.


Special Instructions

Perform the following depending on your course’s platform:

  • Canvas: Upload both your .RMD and .HTML files to the appropriate link
  • Blackboard or iCollege: Compress your .RMD and .HTML files in a .ZIP file and upload to the appropriate link

.HTML files are preferred but not allowed by all platforms.


Before You Submit

Remember to ensure the following before submitting your assignment.

  1. Name your files using this format: Lab-##-LastName.rmd and Lab-##-LastName.html
  2. Show both the solution for your code and write out your answers in the body text
  3. Do not show excessive output; truncate your output, e.g. with function head()
  4. Follow appropriate styling conventions, e.g. spaces after commas, etc.
  5. Above all, ensure that your conventions are consistent

See Google’s R Style Guide for examples of common conventions.



Common Knitting Issues

.RMD files are knit into .HTML and other formats procedural, or line-by-line.

  • An error in code when knitting will halt the process; error messages will tell you the specific line with the error
  • Certain functions like install.packages() or setwd() are bound to cause errors in knitting
  • Altering a dataset or variable in one chunk will affect their use in all later chunks
  • If an object is “not found”, make sure it was created or loaded with library() in a previous chunk

If All Else Fails: If you cannot determine and fix the errors in a code chunk that’s preventing you from knitting your document, add eval = FALSE inside the brackets of {r} at the beginning of a chunk to ensure that R does not attempt to evaluate it, that is: {r eval = FALSE}. This will prevent an erroneous chunk of code from halting the knitting process.