class: center, middle, inverse, title-slide .title[ # Programming Tools in Data Science ] .subtitle[ ## Lecture #3: GitHub ] .author[ ### Samuel Orso ] .date[ ### 28 September 2023 ] --- # GitHub <img src="images/github.png" width="950" height="450" style="display: block; margin: auto;" /> --- # Motivation * When working on a project, there are usually different people working on the same file/folder * You want to avoid sending each modification by email * You could use dropbox/google drive and the likes but it is good practice to keep track of modifications and have a platform to plan and discuss changes --- # Motivation GitHub allows you: - record the entire history of a file; - revert to a specific version of the file; - collaborate on the same platform with other people; - make changes without modifying the main file and add them once you feel comfortable with them. --- # Motivation GitHub will be used for: - work in group on projects and homeworks; - submit projects/homeworks; - develop R packages and website; - ... --- # Ready ? <center> <iframe src="https://giphy.com/embed/h4TdHo3RExSbHd9bOe" width="480" height="425" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/cbc-schitts-creek-h4TdHo3RExSbHd9bOe">via GIPHY</a></p> </center> --- # In fact, what is Git? <img src="images/git.png" style="width:150px; position:absolute; top:9%; left:40%" /> Git is a **distributed version control system**. * **distributed**: whenever you instruct Git to share files, Git does not only share the latest file version, but instead, it distributes **every version** it has recorded for that project. * **version control system**: many people are used to have *their own version control system* e.g. by having different versions of the same file (`file_v1.R`, `file_v2.R`, ...). This approach is error-prone and ineffective when working in team project. Thus, a version control system keeps track of changes to modification in your project. --- # Types of VCS There are three types of version control system (VCS): * local * centralized * distributed --- # Types of VCS ## Local .pull-left[ <img src="images/local_vcs.jpg" width="451" height="300" style="display: block; margin: auto;" /> ] .pull-right[ * One of the simplest and most commonly used VCS * It keeps patch sets (modification of a file) locally (on your computer) * It can recreate the file at any point in time by adding up the patches ] --- # Types of VCS ## Centralized .pull-left[ <img src="images/centralized_vcs.png" width="490" height="317" style="display: block; margin: auto;" /> ] .pull-right[ * A single server contains all the versioned files * Risk of failure * Risk of database corruption ] --- # Types of VCS ## Distributed .pull-left[ <img src="images/dst_vcs.jpg" width="460" height="416" style="display: block; margin: auto;" /> ] .pull-right[ * Store the entire history of files locally * Sync local changes back to server * Allow multiple users and minimize risks of centralized VCS ] --- # Benefits of VCS * Allow multiple users to collaborate and communicate while working on a project. * Keep tracks of the change history of the files (risk mitigation) with possibility to roll back to previous version. * Different workflows such as branching and merging (not discussed) <img src="images/branching.jpeg" width="410" height="250" style="display: block; margin: auto;" /> --- # So Git and GitHub are the same things? <center> <iframe src="https://giphy.com/embed/3o6YglDndxKdCNw7q8" width="480" height="478" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/nba-basketball-chicago-bulls-3o6YglDndxKdCNw7q8">via GIPHY</a></p> </center> --- # Git vs GitHub Git is a distributed VCS, so what is GitHub exactly? * Git is a software... * ...and GitHub is web-based plateform for software development and version control that uses Git. * GitHub hosts and shares Git repository. * GitHub is not the only service provider --- #BitBucket <img src="images/bitbucket.png" width="2525" style="display: block; margin: auto;" /> --- #GitLab <img src="images/gitlab.png" width="2476" style="display: block; margin: auto;" /> --- #SourceForge <img src="images/sourceforge.png" width="2511" style="display: block; margin: auto;" /> --- # Okay to continue ? .center[ <iframe src="https://giphy.com/embed/sG4PBWRjI4GSVCDXEq" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/nickelodeon-drama-club-sG4PBWRjI4GSVCDXEq">via GIPHY</a></p> ] --- # Files states in Git A file can have different states: **untracked**, **modified**, **staged** or **committed** * **untracked**: a new file that is not tracked by Git (yet); * **modified**: a tracked file which is modified but not recorded (not committed yet); * **staged**: a tracked file which is modified and that has been selected to be saved (committed) into the repository during the next commit snapshot; * **committed**: a file that is successfully recorded into the (local) repository. --- # Files states in Git <img src="images/git-basic-workflow-codesweetly.png" width="2560" style="display: block; margin: auto;" /> --- # you can also `.gitignore` * Some files or folders of your project can be excluded from version control by specifying `.gitignore` * These files or folders will not be shared to other users <img src="images/gitignore.png" width="523" style="display: block; margin: auto;" /> --- # GitHub ## Basic workflow The basic workflow is as follows... 1. Open the RStudio Project connected to your Git(Hub) Repo 2. Work on your computer just like always 3. **Save** your work often just like always 4. When you want to preserve a **snapshot** of your project, you make a "commit" 5. When you have a few commits and want to archive them, you "push" them to the GitHub remote server 6. If you decide to work from a different computer, or want to pick up where a collaborator left off, you can "pull" the most up-to-date version of the files from the GitHub remote to your local computer and go back to step 2. --- class: sydney-blue, center, middle # Demo on RStudio --- # GitHub ## New habits * When you want to preserve a **snapshot** of your project, you make a "commit." * When you have a few commits and want to archive them, you "push" them to the GitHub remote server. * If you decide to work from a different computer, or want to pick up where a collaborator left off, you can "pull" the most up-to-date version of the files from the GitHub remote to your local computer. --- # GitHub ## Commits Make your commit message as informative and concise as possible. <img src="images/git_commit.png" width="439" height="250" style="display: block; margin: auto;" /> --- # GitHub ## "pull" before you "push" Make sure you have the up-to-date version of your project before working on it. Try to avoid the headaches of "merge conflict". .center[ <iframe src="https://giphy.com/embed/cFkiFMDg3iFoI" width="480" height="269" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/git-merge-cFkiFMDg3iFoI">via GIPHY</a></p> ] --- # GitHub ## Common mistakes (and how to solve them) * **Commits in the wrong Repo**. Nothing seems to work? It's a common mistake. Solution: make sure you work on the correct RStudio project that is correctly linked to GitHub. * **Large files error**. GitHub blocks pushes that exceed 100 MB. Solution: find another solution for large files (Dropbox, ...) * **Conflict (not merge)**. Conflicts may happen when two collaborators make different changes to part of a program at the same time but on different lines of code. One of them push the modification to the remote. The second one to push will have a conflict as his/her version of the project is "outdated". Solution: `git pull --rebase` * **Merge conflict**. It happens when two collaborators work on the same lines of code at the same time. It is often a problem of miscommunication within groups and lack of organization. Solution: To resolve these conflicts, we must directly edit the documents making sure potential conflicts are discussed before pushing. --- # GitHub > git gets easier once you get the basic idea that branches are homeomorphic endofunctors mapping submanifolds of a Hilbert space. > <cite> Isaac Wolkerstorfer </cite> --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2023/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] --- # In-class exercise (10 minutes) 1. Create a GitHub repo for the RMarkdown file (.Rmd) you created in the last class. 1. Edit the README.md file, push the .Rmd. 1. By two. Invite (person A) someone else (person B) to work on your repo and try: - Repo is up-to-date. Person B modifies .Rmd and pushes the changes, person A pulls the changes. - Repo is up-to-date. Person A modifies 1st section of .Rmd, person B modifies 2nd section (no conflict) of .Rmd. No push, no pull in between. Now person A commits and pushes. Then person B tries to commit and push. Try to solve until repo is up-to-date. - Same as last point, but person B modifies 1st section of .Rmd (conflict). 1. (optional) Complete the exercise "The Basics of Github" (you need to register at https://tinyurl.com/ptds2023 and wait for an invitation).